The dataset used in this project contains the following information about 205 cars:
All the variables are meassured in the United States’ units such as miles, dollars, inches, …
skimr::skim(data)| Name | data |
| Number of rows | 205 |
| Number of columns | 25 |
| _______________________ | |
| Column type frequency: | |
| factor | 11 |
| numeric | 14 |
| ________________________ | |
| Group variables | None |
Variable type: factor
| skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
|---|---|---|---|---|---|
| symboling | 0 | 1.00 | FALSE | 6 | 0: 67, 1: 54, 2: 32, 3: 27 |
| make | 0 | 1.00 | FALSE | 22 | toy: 32, nis: 18, maz: 17, hon: 13 |
| fuel_type | 0 | 1.00 | FALSE | 2 | gas: 185, die: 20 |
| aspiration | 0 | 1.00 | FALSE | 2 | std: 168, tur: 37 |
| num_of_doors | 2 | 0.99 | FALSE | 2 | fou: 114, two: 89 |
| body_style | 0 | 1.00 | FALSE | 5 | sed: 96, hat: 70, wag: 25, har: 8 |
| drive_wheels | 0 | 1.00 | FALSE | 3 | fwd: 120, rwd: 76, 4wd: 9 |
| engine_location | 0 | 1.00 | FALSE | 2 | fro: 202, rea: 3 |
| engine_type | 0 | 1.00 | FALSE | 7 | ohc: 148, ohc: 15, ohc: 13, doh: 12 |
| num_of_cylinders | 0 | 1.00 | FALSE | 7 | fou: 159, six: 24, fiv: 11, eig: 5 |
| fuel_system | 0 | 1.00 | FALSE | 8 | mpf: 94, 2bb: 66, idi: 20, 1bb: 11 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| wheel_base | 0 | 1.00 | 98.76 | 6.02 | 86.60 | 94.50 | 97.00 | 102.40 | 120.90 | ▁▇▂▁▁ |
| length | 0 | 1.00 | 174.05 | 12.34 | 141.10 | 166.30 | 173.20 | 183.10 | 208.10 | ▁▅▇▃▁ |
| width | 0 | 1.00 | 65.91 | 2.15 | 60.30 | 64.10 | 65.50 | 66.90 | 72.30 | ▁▇▇▂▁ |
| height | 0 | 1.00 | 53.72 | 2.44 | 47.80 | 52.00 | 54.10 | 55.50 | 59.80 | ▁▆▇▆▂ |
| curb_weight | 0 | 1.00 | 2555.57 | 520.68 | 1488.00 | 2145.00 | 2414.00 | 2935.00 | 4066.00 | ▃▇▅▃▁ |
| engine_size | 0 | 1.00 | 126.91 | 41.64 | 61.00 | 97.00 | 120.00 | 141.00 | 326.00 | ▇▆▂▁▁ |
| bore | 4 | 0.98 | 3.33 | 0.27 | 2.54 | 3.15 | 3.31 | 3.59 | 3.94 | ▁▅▇▇▂ |
| stroke | 4 | 0.98 | 3.26 | 0.32 | 2.07 | 3.11 | 3.29 | 3.41 | 4.17 | ▁▂▇▇▁ |
| compressio_ratio | 0 | 1.00 | 10.14 | 3.97 | 7.00 | 8.60 | 9.00 | 9.40 | 23.00 | ▇▁▁▁▁ |
| horsepower | 2 | 0.99 | 104.26 | 39.71 | 48.00 | 70.00 | 95.00 | 116.00 | 288.00 | ▇▅▂▁▁ |
| peak_rpm | 2 | 0.99 | 5125.37 | 479.33 | 4150.00 | 4800.00 | 5200.00 | 5500.00 | 6600.00 | ▂▇▇▂▁ |
| city_mpg | 0 | 1.00 | 17.12 | 0.68 | 17.00 | 17.00 | 17.00 | 17.00 | 24.00 | ▇▁▁▁▁ |
| highway_mpg | 0 | 1.00 | 30.75 | 6.89 | 16.00 | 25.00 | 30.00 | 34.00 | 54.00 | ▂▇▆▁▁ |
| price | 4 | 0.98 | 13207.13 | 7947.07 | 5118.00 | 7775.00 | 10295.00 | 16500.00 | 45400.00 | ▇▃▁▁▁ |
In the previous summary of the data, we can see a big difference in the values from the numerical variables as the data is not normalized. Therefore, we can expect the variances to be quite different just because of the units of measurements. This has to be taken into consideration before carrying out any dimension reduction analysis.
Moreover, there are a few missing values in the following variables: num_of_doors, bore, stroke, horsepower, peak_rpm and price. As this absence may difficult the study, we will perform an imputation based on predictive mean matching for these variables.
The predictive mean matching will complete the observations using a linear regression of the missing value using the rest of the variables. The estimation of the predictors is accomplished with the covariance matrix.
\[ x_j = \beta_0 + \beta_1 * x_1 + ... + \beta_p * x_p + \epsilon_j \]
data_imp = mice(data,method="pmm", seed = 100407485)##
## iter imp variable
## 1 1 num_of_doors* bore* stroke* horsepower* peak_rpm* price*
## 1 2 num_of_doors bore* stroke* horsepower* peak_rpm* price*
## 1 3 num_of_doors* bore* stroke* horsepower* peak_rpm* price*
## 1 4 num_of_doors* bore* stroke* horsepower* peak_rpm* price*
## 1 5 num_of_doors* bore* stroke* horsepower* peak_rpm* price*
## 2 1 num_of_doors* bore* stroke* horsepower peak_rpm* price*
## 2 2 num_of_doors* bore* stroke horsepower* peak_rpm* price*
## 2 3 num_of_doors* bore* stroke* horsepower* peak_rpm* price*
## 2 4 num_of_doors* bore* stroke* horsepower* peak_rpm* price*
## 2 5 num_of_doors* bore* stroke* horsepower* peak_rpm* price*
## 3 1 num_of_doors* bore* stroke* horsepower* peak_rpm* price*
## 3 2 num_of_doors* bore* stroke* horsepower* peak_rpm* price*
## 3 3 num_of_doors* bore* stroke* horsepower* peak_rpm* price*
## 3 4 num_of_doors* bore* stroke* horsepower* peak_rpm* price*
## 3 5 num_of_doors* bore* stroke* horsepower* peak_rpm* price*
## 4 1 num_of_doors* bore* stroke* horsepower* peak_rpm* price*
## 4 2 num_of_doors* bore* stroke horsepower* peak_rpm* price*
## 4 3 num_of_doors* bore* stroke* horsepower* peak_rpm* price*
## 4 4 num_of_doors* bore* stroke* horsepower* peak_rpm* price*
## 4 5 num_of_doors* bore* stroke* horsepower* peak_rpm* price*
## 5 1 num_of_doors* bore stroke* horsepower* peak_rpm* price*
## 5 2 num_of_doors* bore* stroke horsepower* peak_rpm* price*
## 5 3 num_of_doors* bore* stroke* horsepower* peak_rpm* price*
## 5 4 num_of_doors* bore* stroke* horsepower* peak_rpm* price*
## 5 5 num_of_doors* bore* stroke* horsepower* peak_rpm* price*
data_imp = complete(data_imp)
data <- data_impThe main characteristics of the quantitative variables are:
# Mean vector
m = colMeans(data[,c("wheel_base", "length", "width", "height", "curb_weight", "engine_size", "bore", "stroke", "compressio_ratio", "horsepower", "peak_rpm", "city_mpg", "highway_mpg", "price")], na.rm = T)\[ \overline{x} = (98.76, 174.05, 65.91, 53.72, 2555.57, 126.91, 3.33 ) \]
That is to say that mean measurements of the cars that are in this data set are:
This cars also have an engine size with a mean of 126.91 cubic inches, while the mean of the bore and stroke is of 3.33 and 3.26 inches respectively. The mean of the compression ratio is 10.14. In addition, the vehicles have mean of the horsepower of 104.26, with the mean of the peak revolutions per minute of 5125.37 revolutions, and the mean consumption in the city and in the highway are 17.12 and 30.75 miles per gallon each.
These characteristics make the cars' value to be around 13207.13$.
Before, calculating the covariance and the correlation matrixes, we will eliminate the city_mpg variable as it is almost constant and it interferes with the study.
data <- data[,!names(data) %in% c("city_mpg")]# Covariance matrix
S = var(data[,c("wheel_base", "length", "width", "height", "curb_weight", "engine_size", "bore", "stroke", "compressio_ratio", "horsepower", "peak_rpm", "highway_mpg", "price")], na.rm = T)
round(S, 2)## wheel_base length width height curb_weight engine_size
## wheel_base 36.26 64.98 10.27 8.67 2434.30 142.77
## length 64.98 152.21 22.26 14.80 5638.34 351.08
## width 10.27 22.26 4.60 1.46 968.45 65.70
## height 8.67 14.80 1.46 5.97 376.05 6.83
## curb_weight 2434.30 5638.34 968.45 376.05 271107.87 18443.03
## engine_size 142.77 351.08 65.70 6.83 18443.03 1734.11
## bore 0.83 2.08 0.33 0.16 92.90 7.15
## stroke 0.35 0.56 0.13 0.01 29.30 3.31
## compressio_ratio 5.97 7.76 1.54 2.54 313.04 4.79
## horsepower 85.17 267.56 54.10 -10.22 15470.60 1331.47
## peak_rpm -1010.63 -1737.84 -231.10 -365.30 -65787.45 -4907.10
## highway_mpg -22.56 -59.87 -10.00 -1.81 -2859.42 -194.28
## price 27862.47 67150.39 12163.95 2840.08 3358912.33 281913.60
## bore stroke compressio_ratio horsepower peak_rpm
## wheel_base 0.83 0.35 5.97 85.17 -1010.63
## length 2.08 0.56 7.76 267.56 -1737.84
## width 0.33 0.13 1.54 54.10 -231.10
## height 0.16 0.01 2.54 -10.22 -365.30
## curb_weight 92.90 29.30 313.04 15470.60 -65787.45
## engine_size 7.15 3.31 4.79 1331.47 -4907.10
## bore 0.08 0.00 0.01 6.01 -43.43
## stroke 0.00 0.11 0.24 0.68 -26.83
## compressio_ratio 0.01 0.24 15.78 -31.53 -812.12
## horsepower 6.01 0.68 -31.53 1590.39 2982.54
## peak_rpm -43.43 -26.83 -812.12 2982.54 236744.98
## highway_mpg -1.01 0.00 7.25 -209.93 -180.60
## price 1138.46 211.68 2354.57 231850.01 -392664.55
## highway_mpg price
## wheel_base -22.56 27862.47
## length -59.87 67150.39
## width -10.00 12163.95
## height -1.81 2840.08
## curb_weight -2859.42 3358912.33
## engine_size -194.28 281913.60
## bore -1.01 1138.46
## stroke 0.00 211.68
## compressio_ratio 7.25 2354.57
## horsepower -209.93 231850.01
## peak_rpm -180.60 -392664.55
## highway_mpg 47.42 -37747.13
## price -37747.13 62651214.04
As it was expected, the values from the variances differ a lot between each variable due to the measurements. The variable with the highest variance is the price.
# Correlation matrix
R = cor(data[,c("wheel_base", "length", "width", "height", "curb_weight", "engine_size", "bore", "stroke", "compressio_ratio", "horsepower", "peak_rpm", "highway_mpg", "price")], use="complete.obs")
round(R, 2)## wheel_base length width height curb_weight engine_size bore
## wheel_base 1.00 0.87 0.80 0.59 0.78 0.57 0.49
## length 0.87 1.00 0.84 0.49 0.88 0.68 0.60
## width 0.80 0.84 1.00 0.28 0.87 0.74 0.54
## height 0.59 0.49 0.28 1.00 0.30 0.07 0.23
## curb_weight 0.78 0.88 0.87 0.30 1.00 0.85 0.63
## engine_size 0.57 0.68 0.74 0.07 0.85 1.00 0.61
## bore 0.49 0.60 0.54 0.23 0.63 0.61 1.00
## stroke 0.18 0.14 0.18 0.01 0.17 0.24 0.02
## compressio_ratio 0.25 0.16 0.18 0.26 0.15 0.03 0.01
## horsepower 0.35 0.54 0.63 -0.10 0.75 0.80 0.54
## peak_rpm -0.34 -0.29 -0.22 -0.31 -0.26 -0.24 -0.32
## highway_mpg -0.54 -0.70 -0.68 -0.11 -0.80 -0.68 -0.52
## price 0.58 0.69 0.72 0.15 0.82 0.86 0.51
## stroke compressio_ratio horsepower peak_rpm highway_mpg price
## wheel_base 0.18 0.25 0.35 -0.34 -0.54 0.58
## length 0.14 0.16 0.54 -0.29 -0.70 0.69
## width 0.18 0.18 0.63 -0.22 -0.68 0.72
## height 0.01 0.26 -0.10 -0.31 -0.11 0.15
## curb_weight 0.17 0.15 0.75 -0.26 -0.80 0.82
## engine_size 0.24 0.03 0.80 -0.24 -0.68 0.86
## bore 0.02 0.01 0.54 -0.32 -0.52 0.51
## stroke 1.00 0.19 0.05 -0.17 0.00 0.08
## compressio_ratio 0.19 1.00 -0.20 -0.42 0.27 0.07
## horsepower 0.05 -0.20 1.00 0.15 -0.76 0.73
## peak_rpm -0.17 -0.42 0.15 1.00 -0.05 -0.10
## highway_mpg 0.00 0.27 -0.76 -0.05 1.00 -0.69
## price 0.08 0.07 0.73 -0.10 -0.69 1.00
corrplot(R,order="hclust")These correlation matrix show a positive linear relation between the variables that indicate the dimension's of the car and the ones related to the power of the engine. In addition, these two groups are correlated to the price positively. Therefore, we can say that the most expensive cars will be the bigger ones with a powerful engine. On the other hand, we can see that the bigger and more expensive the car, the less miles per gallon (they consume more).
A scatterplot matrix with every numeric variable is shown, in order to carry out the first graphical analysis.
ggpairs(data,
columns = c("wheel_base", "length", "width", "height", "curb_weight", "engine_size", "bore", "stroke", "compressio_ratio", "horsepower", "peak_rpm", "highway_mpg", "price"),
upper = list(continuous = wrap("cor", size = 1.5)),
lower = list(continuous = wrap("points", size = 0.5))) +
theme(strip.text.x = element_text(size = 4),
strip.text.y = element_text(size = 2),
axis.text.x = element_text(size = 2),
axis.text.y = element_text(size = 2))In the plot, it can be seen that the length and the peak_rpm of the car are the only variables which are not heavily skewed. All the other variables are right skewed (the mean is larger than the median), except for the height, the bore and the stroke which are left skewed (the mean is smaller than the median).
Moreover, a correlation analysis can be carried out with this plot:
Positive correlation between the wheel_base, the length, the width, the curb_weight and the engine_size of the car: The vehicles with a large wheel base will be the bigger ones as they will also be longer, wider and heavier.
Positive correlation between the engine_size, the horsepower and the price of the car: the bigger cars are usually the ones with a the most powerful engines. Therefore, the price are also higher.
The price has a positive correlation with the width, the curb_weight, the engine_size and the horsepower but a negative correlation with the highway_mpg: The bigger cars are the ones that consume the most fuel int he highway.
Fuel_typeThe main characteristics for each group are calculated in order to get a first idea of the differences these groups may have.
We have to take into account that there is a big difference in the sample sizes for each group: there are only 20 cars fueled by diesel while there are 185 cars powered by gas. Therefore, the conclusions may not be reliable.
# Mean vector
m_fuel = aggregate(x = data[,c("wheel_base", "length", "width", "height", "curb_weight", "engine_size", "bore", "stroke", "compressio_ratio", "horsepower", "peak_rpm", "highway_mpg", "price")],
by = list(data$`fuel_type`),
FUN = mean)
m_fuelAfter comparing both mean vectors, we can say that the cars fueled by gas are usually smaller but more powerful and environmentally friendly. However, the engine size does not seem different at first, we would have to carry out an hypothesis test to check the existance of this difference. Moreover, the gas cars have are cheaper if we look at the mean, which may be caused due to the size as they are smaller.
# Covariance matrix
data_num = data[,c("wheel_base", "length", "width", "height", "curb_weight", "engine_size", "bore", "stroke", "compressio_ratio", "horsepower", "peak_rpm", "highway_mpg", "price")]
data_gas = data_num[data$`fuel_type`=="gas",]
data_diesel = data_num[data$`fuel_type`=="diesel",]
(S_diesel = cov(data_diesel))## wheel_base length width height
## wheel_base 47.1893684 76.5011579 13.7350526 8.40089474
## length 76.5011579 131.4598947 23.5165263 14.73647368
## width 13.7350526 23.5165263 5.0685263 2.31100000
## height 8.4008947 14.7364737 2.3110000 2.69397368
## curb_weight 3821.7978947 6316.3231579 1261.6957895 674.39157895
## engine_size 188.4178947 303.8400000 66.3094737 28.79000000
## bore 1.4819421 2.3142684 0.4227000 0.26399211
## stroke 0.5268474 0.8078895 0.1820368 0.08627105
## compressio_ratio -3.9831053 -6.1011053 -1.0485789 -0.78071053
## horsepower 154.3257895 254.9057895 56.3489474 24.38868421
## peak_rpm -1108.5789474 -1632.4736842 -272.4210526 -188.13157895
## highway_mpg -53.3868421 -87.5184211 -17.2184211 -8.52763158
## price 41364.9278947 67938.1405263 16108.7478947 5258.03131579
## curb_weight engine_size bore stroke
## wheel_base 3821.79789 188.417895 1.48194211 0.52684737
## length 6316.32316 303.840000 2.31426842 0.80788947
## width 1261.69579 66.309474 0.42270000 0.18203684
## height 674.39158 28.790000 0.26399211 0.08627105
## curb_weight 342676.80000 17824.831579 129.72042105 49.33852632
## engine_size 17824.83158 1031.378947 7.30352632 2.97278947
## bore 129.72042 7.303526 0.08141553 0.02076342
## stroke 49.33853 2.972789 0.02076342 0.01191026
## compressio_ratio -342.49895 -19.287895 -0.20886579 -0.06197632
## horsepower 14554.20000 797.878947 4.95313158 2.20581579
## peak_rpm -91711.05263 -4667.894737 -60.56052632 -15.54473684
## highway_mpg -4662.00000 -237.078947 -1.68513158 -0.57381579
## price 3990844.92632 224512.205263 1173.80507895 690.81807895
## compressio_ratio horsepower peak_rpm highway_mpg
## wheel_base -3.983105e+00 154.325789 -1108.57895 -5.338684e+01
## length -6.101105e+00 254.905789 -1632.47368 -8.751842e+01
## width -1.048579e+00 56.348947 -272.42105 -1.721842e+01
## height -7.807105e-01 24.388684 -188.13158 -8.527632e+00
## curb_weight -3.424989e+02 14554.200000 -91711.05263 -4.662000e+03
## engine_size -1.928789e+01 797.878947 -4667.89474 -2.370789e+02
## bore -2.088658e-01 4.953132 -60.56053 -1.685132e+00
## stroke -6.197632e-02 2.205816 -15.54474 -5.738158e-01
## compressio_ratio 6.447105e-01 -13.244474 159.55263 4.153947e+00
## horsepower -1.324447e+01 673.839474 -3303.42105 -2.025658e+02
## peak_rpm 1.595526e+02 -3303.421053 57394.73684 1.227632e+03
## highway_mpg 4.153947e+00 -202.565789 1227.63158 7.440789e+01
## price -2.933590e+03 188064.771053 -756516.05263 -5.394343e+04
## price
## wheel_base 41364.9279
## length 67938.1405
## width 16108.7479
## height 5258.0313
## curb_weight 3990844.9263
## engine_size 224512.2053
## bore 1173.8051
## stroke 690.8181
## compressio_ratio -2933.5903
## horsepower 188064.7711
## peak_rpm -756516.0526
## highway_mpg -53943.4342
## price 60215174.4500
(S_gas = cov(data_gas))## wheel_base length width height
## wheel_base 31.5080511 58.7365755 8.936954e+00 7.31662250
## length 58.7365755 147.5453596 2.079283e+01 12.86681081
## width 8.9369536 20.7928305 4.299642e+00 0.99715188
## height 7.3166225 12.8668108 9.971519e-01 5.80531551
## curb_weight 2071.3590041 5269.8623090 8.804974e+02 260.05584606
## engine_size 132.8623942 349.4370065 6.437960e+01 2.36779377
## bore 0.7335082 2.0135600 3.088380e-01 0.13221202
## stroke 0.1645449 0.3075881 7.604377e-02 -0.06227529
## compressio_ratio -1.0137117 -2.1374933 -3.554993e-01 -0.12345073
## horsepower 91.5720917 288.8049060 5.769899e+01 -8.94572562
## peak_rpm -544.0339307 -1105.3376910 -1.032645e+02 -212.53437133
## highway_mpg -22.2152526 -61.1724471 -1.004702e+01 -2.13863396
## price 24752.8842685 64796.0629994 1.131831e+04 1906.61326381
## curb_weight engine_size bore stroke
## wheel_base 2071.35900 132.862394 7.335082e-01 0.164544888
## length 5269.86231 349.437006 2.013560e+00 0.307588102
## width 880.49741 64.379598 3.088380e-01 0.076043772
## height 260.05585 2.367794 1.322120e-01 -0.062275294
## curb_weight 251001.24971 18243.596063 8.732101e+01 17.382958284
## engine_size 18243.59606 1806.791598 7.113787e+00 3.105043184
## bore 87.32101 7.113787 7.893871e-02 -0.002003211
## stroke 17.38296 3.105043 -2.003211e-03 0.108816816
## compressio_ratio -107.97679 -5.258311 -4.244221e-02 -0.072542092
## horsepower 16446.23414 1414.220035 6.283735e+00 1.089693008
## peak_rpm -35320.67421 -4237.414806 -3.736937e+01 -8.290085194
## highway_mpg -2854.13646 -195.149177 -9.762920e-01 -0.058898649
## price 3198176.68346 286459.443155 1.122707e+03 83.138801410
## compressio_ratio horsepower peak_rpm highway_mpg
## wheel_base -1.013712e+00 91.572092 -5.440339e+02 -2.221525e+01
## length -2.137493e+00 288.804906 -1.105338e+03 -6.117245e+01
## width -3.554993e-01 57.698986 -1.032645e+02 -1.004702e+01
## height -1.234507e-01 -8.945726 -2.125344e+02 -2.138634e+00
## curb_weight -1.079768e+02 16446.234136 -3.532067e+04 -2.854136e+03
## engine_size -5.258311e+00 1414.220035 -4.237415e+03 -1.951492e+02
## bore -4.244221e-02 6.283735 -3.736937e+01 -9.762920e-01
## stroke -7.254209e-02 1.089693 -8.290085e+00 -5.889865e-02
## compressio_ratio 4.764337e-01 -6.044762 5.592778e+01 1.900237e+00
## horsepower -6.044762e+00 1648.915100 2.067017e+03 -2.025480e+02
## peak_rpm 5.592778e+01 2067.016745 2.007171e+05 9.341951e-01
## highway_mpg 1.900237e+00 -202.548032 9.341951e-01 4.296839e+01
## price -1.017411e+03 244019.086222 -1.316141e+05 -3.760490e+04
## price
## wheel_base 24752.8843
## length 64796.0630
## width 11318.3139
## height 1906.6133
## curb_weight 3198176.6835
## engine_size 286459.4432
## bore 1122.7068
## stroke 83.1388
## compressio_ratio -1017.4113
## horsepower 244019.0862
## peak_rpm -131614.0937
## highway_mpg -37604.8994
## price 62331611.9750
# Correlation matrix
(R_diesel = cor(data_diesel))## wheel_base length width height curb_weight
## wheel_base 1.0000000 0.9712908 0.8881113 0.7450867 0.9503931
## length 0.9712908 1.0000000 0.9110363 0.7830694 0.9410774
## width 0.8881113 0.9110363 1.0000000 0.6254065 0.9573516
## height 0.7450867 0.7830694 0.6254065 1.0000000 0.7018966
## curb_weight 0.9503931 0.9410774 0.9573516 0.7018966 1.0000000
## engine_size 0.8540650 0.8251619 0.9171188 0.5461801 0.9481432
## bore 0.7560587 0.7073974 0.6580180 0.5636904 0.7766268
## stroke 0.7027519 0.6456462 0.7408969 0.4816236 0.7722949
## compressio_ratio -0.7221332 -0.6627191 -0.5800667 -0.5923944 -0.7286761
## horsepower 0.8654425 0.8564556 0.9641991 0.5724183 0.9577844
## peak_rpm -0.6736100 -0.5943107 -0.5050842 -0.4784417 -0.6539481
## highway_mpg -0.9009540 -0.8848982 -0.8866311 -0.6023130 -0.9232523
## price 0.7759919 0.7635967 0.9220786 0.4128322 0.8785562
## engine_size bore stroke compressio_ratio horsepower
## wheel_base 0.8540650 0.7560587 0.7027519 -0.7221332 0.8654425
## length 0.8251619 0.7073974 0.6456462 -0.6627191 0.8564556
## width 0.9171188 0.6580180 0.7408969 -0.5800667 0.9641991
## height 0.5461801 0.5636904 0.4816236 -0.5923944 0.5724183
## curb_weight 0.9481432 0.7766268 0.7722949 -0.7286761 0.9577844
## engine_size 1.0000000 0.7970212 0.8481923 -0.7479857 0.9570831
## bore 0.7970212 1.0000000 0.6667831 -0.9116567 0.6687257
## stroke 0.8481923 0.6667831 1.0000000 -0.7072659 0.7786286
## compressio_ratio -0.7479857 -0.9116567 -0.7072659 1.0000000 -0.6354393
## horsepower 0.9570831 0.6687257 0.7786286 -0.6354393 1.0000000
## peak_rpm -0.6067026 -0.8859312 -0.5945476 0.8294411 -0.5311897
## highway_mpg -0.8558041 -0.6846533 -0.6095401 0.5997484 -0.9046454
## price 0.9009026 0.5301386 0.8157368 -0.4708300 0.9336331
## peak_rpm highway_mpg price
## wheel_base -0.6736100 -0.9009540 0.7759919
## length -0.5943107 -0.8848982 0.7635967
## width -0.5050842 -0.8866311 0.9220786
## height -0.4784417 -0.6023130 0.4128322
## curb_weight -0.6539481 -0.9232523 0.8785562
## engine_size -0.6067026 -0.8558041 0.9009026
## bore -0.8859312 -0.6846533 0.5301386
## stroke -0.5945476 -0.6095401 0.8157368
## compressio_ratio 0.8294411 0.5997484 -0.4708300
## horsepower -0.5311897 -0.9046454 0.9336331
## peak_rpm 1.0000000 0.5940493 -0.4069388
## highway_mpg 0.5940493 1.0000000 -0.8058906
## price -0.4069388 -0.8058906 1.0000000
(R_gas = cor(data_gas))## wheel_base length width height curb_weight
## wheel_base 1.00000000 0.8614599 0.7678256 0.54098717 0.7365575
## length 0.86145992 1.0000000 0.8255335 0.43963800 0.8659614
## width 0.76782563 0.8255335 1.0000000 0.19958701 0.8475670
## height 0.54098717 0.4396380 0.1995870 1.00000000 0.2154348
## curb_weight 0.73655751 0.8659614 0.8475670 0.21543475 1.0000000
## engine_size 0.55684878 0.6767871 0.7304292 0.02311942 0.8566797
## bore 0.46510306 0.5900065 0.5301144 0.19530484 0.6203486
## stroke 0.08886395 0.0767642 0.1111730 -0.07835286 0.1051811
## compressio_ratio -0.26163899 -0.2549416 -0.2483827 -0.07423001 -0.3122419
## horsepower 0.40174753 0.5855212 0.6852561 -0.09143311 0.8084057
## peak_rpm -0.21633313 -0.2031141 -0.1111586 -0.19689019 -0.1573616
## highway_mpg -0.60376224 -0.7682782 -0.7391737 -0.13540946 -0.8690850
## price 0.55854846 0.6756654 0.6913714 0.10022949 0.8085564
## engine_size bore stroke compressio_ratio
## wheel_base 0.55684878 0.46510306 0.08886395 -0.26163899
## length 0.67678712 0.59000649 0.07676420 -0.25494163
## width 0.73042915 0.53011437 0.11117305 -0.24838270
## height 0.02311942 0.19530484 -0.07835286 -0.07423001
## curb_weight 0.85667970 0.62034857 0.10518111 -0.31224194
## engine_size 1.00000000 0.59566464 0.22144486 -0.17922186
## bore 0.59566464 1.00000000 -0.02161393 -0.21885263
## stroke 0.22144486 -0.02161393 1.00000000 -0.31859638
## compressio_ratio -0.17922186 -0.21885263 -0.31859638 1.00000000
## horsepower 0.81933934 0.55077438 0.08134990 -0.21566458
## peak_rpm -0.22251268 -0.29687842 -0.05609431 0.18085651
## highway_mpg -0.70038712 -0.53010291 -0.02723847 0.41998329
## price 0.85360103 0.50613592 0.03192284 -0.18669863
## horsepower peak_rpm highway_mpg price
## wheel_base 0.40174753 -0.2163331291 -0.6037622350 0.55854846
## length 0.58552122 -0.2031140742 -0.7682782445 0.67566537
## width 0.68525610 -0.1111585594 -0.7391737241 0.69137143
## height -0.09143311 -0.1968901893 -0.1354094592 0.10022949
## curb_weight 0.80840572 -0.1573616348 -0.8690849912 0.80855642
## engine_size 0.81933934 -0.2225126800 -0.7003871228 0.85360103
## bore 0.55077438 -0.2968784243 -0.5301029099 0.50613592
## stroke 0.08134990 -0.0560943091 -0.0272384746 0.03192284
## compressio_ratio -0.21566458 0.1808565064 0.4199832950 -0.18669863
## horsepower 1.00000000 0.1136194449 -0.7609468761 0.76115038
## peak_rpm 0.11361944 1.0000000000 0.0003181053 -0.03720969
## highway_mpg -0.76094688 0.0003181053 1.0000000000 -0.72663399
## price 0.76115038 -0.0372096885 -0.7266339923 1.00000000
In other to perform a better analysis, the fuel type will be taken into consideration.
ggpairs(data,
aes(color = `fuel_type`),
columns = c("wheel_base", "length", "width", "height", "curb_weight", "engine_size", "bore", "stroke", "compressio_ratio", "horsepower", "peak_rpm", "highway_mpg", "price"),
upper = list(continuous = wrap("cor", size = 1.5)),
lower = list(continuous = wrap("points", size = 0.5))) +
theme(strip.text.x = element_text(size = 4),
strip.text.y = element_text(size = 2),
axis.text.x = element_text(size = 2),
axis.text.y = element_text(size = 2))In this case, the plot shows that the compression ratio will distinguish the cars' fuel type almost perfectly as we can see that those vehicles with a smaller compression ratio will be powered by gas while the ones with a higher compression ratio will be fueled by diesel.
In addition, the wheel_base, the width, the height, the curb_weight and the highway_mpg will help in order to difference the car's fuel as the larger cars are usually powered by diesel.
The peak revolution per minute (peak_rpm) is also a good indicator as the diesel cars usually cannot reach the higher revolutions.
SymbolingThe symboling indicates the security of the car (being \(-2\) the value of the safest cars and \(3\) the most dangerous cars), hence we considered that it will also be a good variable to analyze.
# Mean vector
m_symb = aggregate(x = data[,c("wheel_base", "length", "width", "height", "curb_weight", "engine_size", "bore", "stroke", "compressio_ratio", "horsepower", "peak_rpm", "highway_mpg", "price")],
by = list(data$symboling),
FUN = mean)
m_symbIn this case, we can see that the safest cars have the largest means when it comes to the size, but the cars with the biggest mean in the horsepower variable are the most dangerous. Moreover, the mean of the price of the safest cars is not the largest as it is the one from the most dangerous cars.
Before analyzing the variable, we will be gathering the cars in three groups as there are groups with very few instances:
# Covariance matrix
data_num = data[,c("wheel_base", "length", "width", "height", "curb_weight", "engine_size", "bore", "stroke", "compressio_ratio", "horsepower", "peak_rpm", "highway_mpg", "price")]
data_r = data_num[(data$symboling==3) | (data$symboling==2),]
data_n = data_num[(data$symboling==0) | (data$symboling==1),]
data_s = data_num[(data$symboling==-1) | (data$symboling==-2),]
(S_r = cov(data_r))## wheel_base length width height
## wheel_base 1.412119e+01 28.9635447 2.7504296 4.0731911
## length 2.896354e+01 102.0819345 10.7541818 6.4423700
## width 2.750430e+00 10.7541818 2.1563238 -0.2477031
## height 4.073191e+00 6.4423700 -0.2477031 5.8428755
## curb_weight 5.081797e+02 2879.3997954 472.7517534 -137.7914085
## engine_size 6.724956e+00 152.9789012 31.4948860 -18.6793688
## bore 5.196786e-02 0.9502057 0.1043992 -0.0278647
## stroke 3.242247e-01 0.4026204 0.1264705 -0.1042119
## compressio_ratio 1.040162e+00 -0.3035552 -0.3027162 2.8473025
## horsepower 1.558913e+00 182.4067504 32.0248977 -25.6024839
## peak_rpm -2.186470e+02 -249.0385739 -14.6201052 -139.6902396
## highway_mpg -7.706458e+00 -44.0803916 -6.7057569 4.9316189
## price -2.489644e+03 25546.0286967 4974.5707773 -2589.1207773
## curb_weight engine_size bore stroke
## wheel_base 508.17972 6.724956 0.05196786 3.242247e-01
## length 2879.39980 152.978901 0.95020573 4.026204e-01
## width 472.75175 31.494886 0.10439918 1.264705e-01
## height -137.79141 -18.679369 -0.02786470 -1.042119e-01
## curb_weight 151385.57802 11244.914670 61.91029807 1.959516e+01
## engine_size 11244.91467 1252.139684 7.33921683 3.651873e+00
## bore 61.91030 7.339217 0.09677165 2.732481e-02
## stroke 19.59516 3.651873 0.02732481 1.547635e-01
## compressio_ratio -264.60721 -29.159591 -0.28705228 -8.323846e-03
## horsepower 12280.42519 1126.902104 6.29861192 8.676505e-02
## peak_rpm -11294.30158 -2531.706604 -43.41817650 -5.781575e+01
## highway_mpg -2251.57773 -145.517534 -0.87630625 2.467855e-01
## price 1925998.35476 185967.537113 910.53158387 -4.924068e+02
## compressio_ratio horsepower peak_rpm highway_mpg
## wheel_base 1.040162e+00 1.558913e+00 -218.64699 -7.706458e+00
## length -3.035552e-01 1.824068e+02 -249.03857 -4.408039e+01
## width -3.027162e-01 3.202490e+01 -14.62011 -6.705757e+00
## height 2.847302e+00 -2.560248e+01 -139.69024 4.931619e+00
## curb_weight -2.646072e+02 1.228043e+04 -11294.30158 -2.251578e+03
## engine_size -2.915959e+01 1.126902e+03 -2531.70660 -1.455175e+02
## bore -2.870523e-01 6.298612e+00 -43.41818 -8.763063e-01
## stroke -8.323846e-03 8.676505e-02 -57.81575 2.467855e-01
## compressio_ratio 1.030311e+01 -4.590623e+01 -298.99591 1.232608e+01
## horsepower -4.590623e+01 1.620538e+03 5602.46932 -2.151762e+02
## peak_rpm -2.989959e+02 5.602469e+03 192441.55465 -8.846581e+02
## highway_mpg 1.232608e+01 -2.151762e+02 -884.65809 5.389947e+01
## price -3.663823e+03 2.358503e+05 1011449.85389 -3.092265e+04
## price
## wheel_base -2489.6436
## length 25546.0287
## width 4974.5708
## height -2589.1208
## curb_weight 1925998.3548
## engine_size 185967.5371
## bore 910.5316
## stroke -492.4068
## compressio_ratio -3663.8232
## horsepower 235850.2583
## peak_rpm 1011449.8539
## highway_mpg -30922.6470
## price 49733363.7884
(S_n = cov(data_n))## wheel_base length width height
## wheel_base 35.8007328 68.5658099 11.4907741 7.02717355
## length 68.5658099 162.4269215 25.0266494 14.25629270
## width 11.4907741 25.0266494 5.3023829 1.56865702
## height 7.0271736 14.2562927 1.5686570 4.74166391
## curb_weight 2904.4136433 6416.9902548 1107.3962328 451.01820937
## engine_size 190.5608884 434.8481749 79.9675895 10.59264463
## bore 1.0757191 2.4300220 0.4018499 0.17898567
## stroke 0.2922473 0.6374983 0.1225238 0.01332107
## compressio_ratio 4.9689562 7.0893218 1.1866580 1.78077252
## horsepower 125.0676860 306.9158196 66.7429063 -5.12024105
## peak_rpm -1130.2396694 -2393.5192837 -273.2227961 -404.54579890
## highway_mpg -28.0149587 -66.3832438 -11.3938912 -4.09968320
## price 37270.5190083 80896.4136846 14023.4792218 4041.52739669
## curb_weight engine_size bore stroke
## wheel_base 2904.41364 190.560888 1.075719e+00 0.292247314
## length 6416.99025 434.848175 2.430022e+00 0.637498347
## width 1107.39623 79.967590 4.018499e-01 0.122523760
## height 451.01821 10.592645 1.789857e-01 0.013321074
## curb_weight 310501.95275 21949.710675 1.032828e+02 34.410865014
## engine_size 21949.71067 2092.027824 7.531585e+00 3.646768595
## bore 103.28277 7.531585 6.863551e-02 -0.008493747
## stroke 34.41087 3.646769 -8.493747e-03 0.093927948
## compressio_ratio 330.91031 8.709130 1.622938e-01 0.278833134
## horsepower 17368.82665 1554.848072 5.911997e+00 1.396302342
## peak_rpm -91600.60262 -6300.199725 -5.718815e+01 2.894111570
## highway_mpg -3132.12741 -225.162466 -1.057603e+00 -0.136590220
## price 3793972.91618 334286.632507 1.205517e+03 557.756831956
## compressio_ratio horsepower peak_rpm highway_mpg
## wheel_base 4.9689562 125.067686 -1.130240e+03 -2.801496e+01
## length 7.0893218 306.915820 -2.393519e+03 -6.638324e+01
## width 1.1866580 66.742906 -2.732228e+02 -1.139389e+01
## height 1.7807725 -5.120241 -4.045458e+02 -4.099683e+00
## curb_weight 330.9103051 17368.826653 -9.160060e+04 -3.132127e+03
## engine_size 8.7091302 1554.848072 -6.300200e+03 -2.251625e+02
## bore 0.1622938 5.911997 -5.718815e+01 -1.057603e+00
## stroke 0.2788331 1.396302 2.894112e+00 -1.365902e-01
## compressio_ratio 15.4488844 -27.199467 -9.223371e+02 7.127618e+00
## horsepower -27.1994669 1647.697658 1.403805e+03 -2.126931e+02
## peak_rpm -922.3370523 1403.805096 2.487855e+05 1.621419e+02
## highway_mpg 7.1276185 -212.693113 1.621419e+02 4.735523e+01
## price 1257.3287404 238751.093044 -9.798452e+05 -4.056428e+04
## price
## wheel_base 37270.5190
## length 80896.4137
## width 14023.4792
## height 4041.5274
## curb_weight 3793972.9162
## engine_size 334286.6325
## bore 1205.5165
## stroke 557.7568
## compressio_ratio 1257.3287
## horsepower 238751.0930
## peak_rpm -979845.1550
## highway_mpg -40564.2756
## price 67651109.4315
(S_s = cov(data_s))## wheel_base length width height
## wheel_base 19.1317333 31.1023333 7.713533e+00 2.44600000
## length 31.1023333 69.6958333 1.358700e+01 3.33083333
## width 7.7135333 13.5870000 3.689767e+00 0.25383333
## height 2.4460000 3.3308333 2.538333e-01 3.49250000
## curb_weight 1602.4173333 3395.0525000 7.078562e+02 245.48083333
## engine_size 104.5795000 199.8533333 4.473017e+01 8.99666667
## bore 0.3467750 0.9960833 1.600333e-01 0.15883333
## stroke -0.1217033 -0.6197833 -3.381833e-02 -0.02125833
## compressio_ratio 10.0421667 11.9750000 5.140667e+00 1.16833333
## horsepower 43.5953333 144.3200000 1.993600e+01 -2.82833333
## peak_rpm -206.2166667 802.4166667 -1.449000e+02 43.00000000
## highway_mpg -11.2403333 -27.0200000 -4.641833e+00 -1.36333333
## price 28782.2700000 53710.8333333 1.304846e+04 2580.45000000
## curb_weight engine_size bore stroke
## wheel_base 1602.41733 1.045795e+02 0.34677500 -0.12170333
## length 3395.05250 1.998533e+02 0.99608333 -0.61978333
## width 707.85617 4.473017e+01 0.16003333 -0.03381833
## height 245.48083 8.996667e+00 0.15883333 -0.02125833
## curb_weight 190111.75667 1.091884e+04 37.24691667 -11.20995000
## engine_size 10918.83667 8.397100e+02 1.28908333 -0.71976667
## bore 37.24692 1.289083e+00 0.05435833 -0.02667417
## stroke -11.20995 -7.197667e-01 -0.02667417 0.04489933
## compressio_ratio 1043.53417 4.222333e+01 -0.23037500 0.52755833
## horsepower 7194.87000 4.135183e+02 1.88241667 -2.51660000
## peak_rpm 13485.75000 -5.685833e+02 51.28333333 -80.14333333
## highway_mpg -1410.61167 -8.670167e+01 -0.31033333 0.39426667
## price 2939786.07500 1.810252e+05 496.28000000 -175.54641667
## compressio_ratio horsepower peak_rpm highway_mpg
## wheel_base 10.0421667 43.595333 -206.21667 -1.124033e+01
## length 11.9750000 144.320000 802.41667 -2.702000e+01
## width 5.1406667 19.936000 -144.90000 -4.641833e+00
## height 1.1683333 -2.828333 43.00000 -1.363333e+00
## curb_weight 1043.5341667 7194.870000 13485.75000 -1.410612e+03
## engine_size 42.2233333 413.518333 -568.58333 -8.670167e+01
## bore -0.2303750 1.882417 51.28333 -3.103333e-01
## stroke 0.5275583 -2.516600 -80.14333 3.942667e-01
## compressio_ratio 29.6000000 -31.651667 -1183.08333 3.183333e-01
## horsepower -31.6516667 706.993333 5210.16667 -9.259333e+01
## peak_rpm -1183.0833333 5210.166667 237516.66667 -6.805833e+02
## highway_mpg 0.3183333 -92.593333 -680.58333 1.569333e+01
## price 17480.6125000 101885.350000 -227227.50000 -2.232431e+04
## price
## wheel_base 28782.2700
## length 53710.8333
## width 13048.4558
## height 2580.4500
## curb_weight 2939786.0750
## engine_size 181025.2250
## bore 496.2800
## stroke -175.5464
## compressio_ratio 17480.6125
## horsepower 101885.3500
## peak_rpm -227227.5000
## highway_mpg -22324.3083
## price 51973072.7500
# Correlation matrix
(R_r = cor(data_r))## wheel_base length width height curb_weight
## wheel_base 1.00000000 0.76285443 0.49843433 0.44842093 0.3475678
## length 0.76285443 1.00000000 0.72484632 0.26378974 0.7324623
## width 0.49843433 0.72484632 1.00000000 -0.06978485 0.8274350
## height 0.44842093 0.26378974 -0.06978485 1.00000000 -0.1465098
## curb_weight 0.34756776 0.73246229 0.82743504 -0.14650976 1.0000000
## engine_size 0.05057402 0.42788859 0.60611728 -0.21838484 0.8167470
## bore 0.04445546 0.30232156 0.22854200 -0.03705672 0.5115011
## stroke 0.21931898 0.10129477 0.21892627 -0.10958965 0.1280184
## compressio_ratio 0.08623455 -0.00936007 -0.06422359 0.36697469 -0.2118728
## horsepower 0.01030521 0.44847361 0.54175284 -0.26311113 0.7840461
## peak_rpm -0.13263510 -0.05618789 -0.02269573 -0.13173565 -0.0661710
## highway_mpg -0.27933604 -0.59426306 -0.62201170 0.27789705 -0.7882290
## price -0.09394587 0.35852961 0.48036871 -0.15188507 0.7019234
## engine_size bore stroke compressio_ratio
## wheel_base 0.05057402 0.04445546 0.219318981 0.086234548
## length 0.42788859 0.30232156 0.101294768 -0.009360070
## width 0.60611728 0.22854200 0.218926273 -0.064223588
## height -0.21838484 -0.03705672 -0.109589649 0.366974687
## curb_weight 0.81674695 0.51150109 0.128018362 -0.211872818
## engine_size 1.00000000 0.66672885 0.262334243 -0.256726594
## bore 0.66672885 1.00000000 0.223279451 -0.287476760
## stroke 0.26233424 0.22327945 1.000000000 -0.006591823
## compressio_ratio -0.25672659 -0.28747676 -0.006591823 1.000000000
## horsepower 0.79109790 0.50296898 0.005478743 -0.355269490
## peak_rpm -0.16309384 -0.31816183 -0.335013594 -0.212339916
## highway_mpg -0.56014011 -0.38369803 0.085446263 0.523056478
## price 0.74522400 0.41504675 -0.177486684 -0.161855109
## horsepower peak_rpm highway_mpg price
## wheel_base 0.010305206 -0.13263510 -0.27933604 -0.09394587
## length 0.448473609 -0.05618789 -0.59426306 0.35852961
## width 0.541752838 -0.02269573 -0.62201170 0.48036871
## height -0.263111128 -0.13173565 0.27789705 -0.15188507
## curb_weight 0.784046071 -0.06617100 -0.78822904 0.70192344
## engine_size 0.791097898 -0.16309384 -0.56014011 0.74522400
## bore 0.502968978 -0.31816183 -0.38369803 0.41504675
## stroke 0.005478743 -0.33501359 0.08544626 -0.17748668
## compressio_ratio -0.355269490 -0.21233992 0.52305648 -0.16185511
## horsepower 1.000000000 0.31724917 -0.72806881 0.83077395
## peak_rpm 0.317249166 1.00000000 -0.27468425 0.32694222
## highway_mpg -0.728068810 -0.27468425 1.00000000 -0.59725595
## price 0.830773950 0.32694222 -0.59725595 1.00000000
(R_n = cor(data_n))## wheel_base length width height curb_weight
## wheel_base 1.0000000 0.8991510 0.8340033 0.53934833 0.8711247
## length 0.8991510 1.0000000 0.8527823 0.51370313 0.9035875
## width 0.8340033 0.8527823 1.0000000 0.31284345 0.8630488
## height 0.5393483 0.5137031 0.3128435 1.00000000 0.3717034
## curb_weight 0.8711247 0.9035875 0.8630488 0.37170339 1.0000000
## engine_size 0.6963123 0.7459761 0.7592674 0.10635436 0.8612176
## bore 0.6862439 0.7277912 0.6661221 0.31374619 0.7074916
## stroke 0.1593702 0.1632122 0.1736150 0.01996073 0.2014958
## compressio_ratio 0.2112860 0.1415230 0.1311117 0.20806283 0.1510879
## horsepower 0.5149445 0.5932688 0.7140536 -0.05792770 0.7678910
## peak_rpm -0.3787146 -0.3765264 -0.2378860 -0.37246872 -0.3295745
## highway_mpg -0.6803933 -0.7569121 -0.7190391 -0.27359044 -0.8168142
## price 0.7573245 0.7717254 0.7404279 0.22565382 0.8277984
## engine_size bore stroke compressio_ratio
## wheel_base 0.69631233 0.6862439 0.15937021 0.21128596
## length 0.74597611 0.7277912 0.16321221 0.14152303
## width 0.75926736 0.6661221 0.17361504 0.13111170
## height 0.10635436 0.3137462 0.01996073 0.20806283
## curb_weight 0.86121761 0.7074916 0.20149580 0.15108786
## engine_size 1.00000000 0.6285332 0.26015194 0.04844428
## bore 0.62853323 1.0000000 -0.10578582 0.15760822
## stroke 0.26015194 -0.1057858 1.00000000 0.23147197
## compressio_ratio 0.04844428 0.1576082 0.23147197 1.00000000
## horsepower 0.83746269 0.5559311 0.11223887 -0.17047986
## peak_rpm -0.27615829 -0.4376422 0.01893239 -0.47046637
## highway_mpg -0.71536596 -0.5866298 -0.06476468 0.26351897
## price 0.88858263 0.5594500 0.22126362 0.03889223
## horsepower peak_rpm highway_mpg price
## wheel_base 0.51494446 -0.37871463 -0.68039331 0.75732452
## length 0.59326879 -0.37652639 -0.75691214 0.77172541
## width 0.71405359 -0.23788600 -0.71903914 0.74042789
## height -0.05792770 -0.37246872 -0.27359044 0.22565382
## curb_weight 0.76789095 -0.32957454 -0.81681416 0.82779843
## engine_size 0.83746269 -0.27615829 -0.71536596 0.88858263
## bore 0.55593111 -0.43764215 -0.58662976 0.55945001
## stroke 0.11223887 0.01893239 -0.06476468 0.22126362
## compressio_ratio -0.17047986 -0.47046637 0.26351897 0.03889223
## horsepower 1.00000000 0.06933548 -0.76143120 0.71510390
## peak_rpm 0.06933548 1.00000000 0.04723878 -0.23884023
## highway_mpg -0.76143120 0.04723878 1.00000000 -0.71667502
## price 0.71510390 -0.23884023 -0.71667502 1.00000000
(R_s = cor(data_s))## wheel_base length width height curb_weight
## wheel_base 1.00000000 0.8517505 0.91807202 0.29923410 0.84022188
## length 0.85175051 1.0000000 0.84726783 0.21349184 0.93269275
## width 0.91807202 0.8472678 1.00000000 0.07071003 0.84516418
## height 0.29923410 0.2134918 0.07071003 1.00000000 0.30126237
## curb_weight 0.84022188 0.9326927 0.84516418 0.30126237 1.00000000
## engine_size 0.82509649 0.8261204 0.80359348 0.16613027 0.86418599
## bore 0.34004591 0.5117516 0.35733696 0.36453596 0.36639773
## stroke -0.13131229 -0.3503616 -0.08308695 -0.05368353 -0.12133309
## compressio_ratio 0.42199219 0.2636491 0.49189696 0.11490865 0.43990240
## horsepower 0.37484799 0.6501528 0.39032909 -0.05691867 0.62059862
## peak_rpm -0.09673855 0.1972193 -0.15478251 0.04721209 0.06346343
## highway_mpg -0.64870080 -0.8170043 -0.61000373 -0.18415192 -0.81666771
## price 0.91276439 0.8924200 0.94225938 0.19153054 0.93523778
## engine_size bore stroke compressio_ratio
## wheel_base 0.8250965 0.3400459 -0.13131229 0.42199219
## length 0.8261204 0.5117516 -0.35036163 0.26364908
## width 0.8035935 0.3573370 -0.08308695 0.49189696
## height 0.1661303 0.3645360 -0.05368353 0.11490865
## curb_weight 0.8641860 0.3663977 -0.12133309 0.43990240
## engine_size 1.0000000 0.1908022 -0.11722146 0.26781938
## bore 0.1908022 1.0000000 -0.53993048 -0.18161697
## stroke -0.1172215 -0.5399305 1.00000000 0.45761967
## compressio_ratio 0.2678194 -0.1816170 0.45761967 1.00000000
## horsepower 0.5366887 0.3036512 -0.44666980 -0.21879786
## peak_rpm -0.0402608 0.4513320 -0.77606913 -0.44619267
## highway_mpg -0.7552751 -0.3359988 0.46969130 0.01476994
## price 0.8665333 0.2952599 -0.11491662 0.44567841
## horsepower peak_rpm highway_mpg price
## wheel_base 0.37484799 -0.09673855 -0.64870080 0.91276439
## length 0.65015282 0.19721933 -0.81700429 0.89241998
## width 0.39032909 -0.15478251 -0.61000373 0.94225938
## height -0.05691867 0.04721209 -0.18415192 0.19153054
## curb_weight 0.62059862 0.06346343 -0.81666771 0.93523778
## engine_size 0.53668869 -0.04026080 -0.75527513 0.86653326
## bore 0.30365122 0.45133201 -0.33599883 0.29525993
## stroke -0.44666980 -0.77606913 0.46969130 -0.11491662
## compressio_ratio -0.21879786 -0.44619267 0.01476994 0.44567841
## horsepower 1.00000000 0.40206560 -0.87905178 0.53151422
## peak_rpm 0.40206560 1.00000000 -0.35251424 -0.06467324
## highway_mpg -0.87905178 -0.35251424 1.00000000 -0.78168399
## price 0.53151422 -0.06467324 -0.78168399 1.00000000
We plot the results in order to analyze them better.
data$risk <- ifelse(data$symboling==3, 1,
ifelse(data$symboling==2, 1,
ifelse(data$symboling==1, 0,
ifelse(data$symboling==0, 0,
ifelse(data$symboling==-1, -1,
ifelse(data$symboling==-2, -1, 0))))))
data$risk <- as.factor(data$risk)
ggpairs(data,
aes(color = risk),
columns = c("wheel_base", "length", "width", "height", "curb_weight", "engine_size", "bore", "stroke", "compressio_ratio", "horsepower", "peak_rpm", "highway_mpg", "price"),
upper = list(continuous = wrap("cor", size = 1.5)),
lower = list(continuous = wrap("points", size = 0.5))) +
theme(strip.text.x = element_text(size = 4),
strip.text.y = element_text(size = 2),
axis.text.x = element_text(size = 2),
axis.text.y = element_text(size = 2))In the plot, it can be seen that the least secure cars are usually the smallest but they are not the cheapest. However, the safest cars are not the largest either but they usually are in the middle when in comes to the curb_weight and the ones with the largest bores without a very high nor very low peak of the revolutions per minute.
For the most part, the risk of the cars cannot be predicted easily with the plot.
The values that are very different from the rest of the data set are considered outliers.
It is important to study these observations and have them identifed as they can cause a lot of problems in various studies, such as the dimension reduction, due to their influence on the mean vector, the covariance and the correlation matrixes.
In order to determine the existence of outliers in the data set, we are going to estimate the mean vector and the covariance and correlation matrixes with the minimum covariance determinant (MCD) which is based on the Mahalanobis distance:
\[ D_M (x, \mu) = \sqrt{(x- \mu)' \Sigma^{-1} (x- \mu)'}\]
As we have two well-differentiated groups by the fuel_type in our data, we will be searching the outliers in each group.
MCD_est = CovMcd(data_gas,alpha=0.85,nsamp="deterministic")
# Robust mean vector
m_MCD = MCD_est$center
m_MCD## wheel_base length width height
## 97.371053 171.132237 65.276974 53.428947
## curb_weight engine_size bore stroke
## 2391.296053 114.355263 3.289079 3.226579
## compressio_ratio horsepower peak_rpm highway_mpg
## 8.882368 96.388158 5214.802632 31.427632
## price
## 10435.282895
# Robust covariance matrix
S_MCD = MCD_est$cov
S_MCD## wheel_base length width height
## wheel_base 20.4371612 41.7490398 6.08824230 6.44120179
## length 41.7490398 125.6007817 15.20309419 12.56314189
## width 6.0882423 15.2030942 2.69775184 1.11452216
## height 6.4412018 12.5631419 1.11452216 6.49132532
## curb_weight 1433.1743038 4087.1628149 562.37705338 292.85149197
## engine_size 71.6425389 211.7800854 29.33238001 10.27578643
## bore 0.7409726 2.1719858 0.26498173 0.22792689
## stroke 0.1715724 0.3798969 0.08985876 -0.04554487
## compressio_ratio -0.4527135 -1.3745719 -0.27644553 0.08943117
## horsepower 78.5574931 238.4507530 34.89745904 3.76964385
## peak_rpm -282.8869775 -853.9645042 -81.30783436 -195.60660522
## highway_mpg -14.9533718 -47.2197036 -6.71061681 -2.04151058
## price 13686.4232673 37712.3770323 5513.22762220 2256.79250451
## curb_weight engine_size bore stroke
## wheel_base 1433.17430 71.642539 0.740972613 0.171572449
## length 4087.16281 211.780085 2.171985760 0.379896889
## width 562.37705 29.332380 0.264981732 0.089858761
## height 292.85149 10.275786 0.227926891 -0.045544872
## curb_weight 174925.57094 9408.243172 82.489153156 18.909528354
## engine_size 9408.24317 701.130429 5.171584698 3.256370332
## bore 82.48915 5.171585 0.083140528 -0.007573647
## stroke 18.90953 3.256370 -0.007573647 0.105715169
## compressio_ratio -102.62361 -5.832309 -0.056810960 -0.073680239
## horsepower 11344.79830 646.290662 4.788890677 1.670302151
## peak_rpm -20457.90968 -3401.691299 -59.089098384 1.661626115
## highway_mpg -2161.36730 -104.083691 -0.900746622 -0.079903321
## price 1581059.27487 76573.003428 694.548032619 49.767389563
## compressio_ratio horsepower peak_rpm highway_mpg
## wheel_base -0.45271348 78.557493 -282.886978 -1.495337e+01
## length -1.37457186 238.450753 -853.964504 -4.721970e+01
## width -0.27644553 34.897459 -81.307834 -6.710617e+00
## height 0.08943117 3.769644 -195.606605 -2.041511e+00
## curb_weight -102.62360575 11344.798302 -20457.909675 -2.161367e+03
## engine_size -5.83230872 646.290662 -3401.691299 -1.040837e+02
## bore -0.05681096 4.788891 -59.089098 -9.007466e-01
## stroke -0.07368024 1.670302 1.661626 -7.990332e-02
## compressio_ratio 0.52194240 -9.358664 63.667176 1.989785e+00
## horsepower -9.35866355 999.421160 1499.055711 -1.568303e+02
## peak_rpm 63.66717571 1499.055711 227560.012260 -1.517017e+02
## highway_mpg 1.98978490 -156.830294 -151.701675 3.700268e+01
## price -806.22226055 111652.275889 178774.782682 -2.085700e+04
## price
## wheel_base 1.368642e+04
## length 3.771238e+04
## width 5.513228e+03
## height 2.256793e+03
## curb_weight 1.581059e+06
## engine_size 7.657300e+04
## bore 6.945480e+02
## stroke 4.976739e+01
## compressio_ratio -8.062223e+02
## horsepower 1.116523e+05
## peak_rpm 1.787748e+05
## highway_mpg -2.085700e+04
## price 1.874260e+07
# Robust correlation matrix
R_MCD = cov2cor(S_MCD)
R_MCD## wheel_base length width height curb_weight
## wheel_base 1.0000000 0.8240243 0.8199373 0.55922932 0.7579876
## length 0.8240243 1.0000000 0.8259140 0.43998250 0.8719651
## width 0.8199373 0.8259140 1.0000000 0.26633057 0.8186532
## height 0.5592293 0.4399825 0.2663306 1.00000000 0.2748237
## curb_weight 0.7579876 0.8719651 0.8186532 0.27482365 1.0000000
## engine_size 0.5984960 0.7136570 0.6744456 0.15231721 0.8495373
## bore 0.5684411 0.6721318 0.5595109 0.31025764 0.6840121
## stroke 0.1167263 0.1042560 0.1682638 -0.05497992 0.1390546
## compressio_ratio -0.1386123 -0.1697696 -0.2329685 0.04858598 -0.3396326
## horsepower 0.5496713 0.6730204 0.6720762 0.04680146 0.8580169
## peak_rpm -0.1311761 -0.1597334 -0.1037728 -0.16094180 -0.1025384
## highway_mpg -0.5437662 -0.6926446 -0.6716528 -0.13172504 -0.8495429
## price 0.6993020 0.7772716 0.7753361 0.20460207 0.8731857
## engine_size bore stroke compressio_ratio
## wheel_base 0.5984960 0.56844111 0.11672627 -0.13861233
## length 0.7136570 0.67213175 0.10425596 -0.16976963
## width 0.6744456 0.55951089 0.16826383 -0.23296855
## height 0.1523172 0.31025764 -0.05497992 0.04858598
## curb_weight 0.8495373 0.68401213 0.13905461 -0.33963257
## engine_size 1.0000000 0.67735734 0.37823849 -0.30488061
## bore 0.6773573 1.00000000 -0.08078486 -0.27271841
## stroke 0.3782385 -0.08078486 1.00000000 -0.31366863
## compressio_ratio -0.3048806 -0.27271841 -0.31366863 1.00000000
## horsepower 0.7720656 0.52535668 0.16249953 -0.40975879
## peak_rpm -0.2693069 -0.42958885 0.01071314 0.18473804
## highway_mpg -0.6461996 -0.51354635 -0.04039983 0.45277063
## price 0.6679768 0.55639214 0.03535585 -0.25776775
## horsepower peak_rpm highway_mpg price
## wheel_base 0.54967130 -0.13117609 -0.54376625 0.69930200
## length 0.67302044 -0.15973338 -0.69264465 0.77727165
## width 0.67207623 -0.10377277 -0.67165282 0.77533611
## height 0.04680146 -0.16094180 -0.13172504 0.20460207
## curb_weight 0.85801693 -0.10253842 -0.84954292 0.87318565
## engine_size 0.77206560 -0.26930692 -0.64619959 0.66797678
## bore 0.52535668 -0.42958885 -0.51354635 0.55639214
## stroke 0.16249953 0.01071314 -0.04039983 0.03535585
## compressio_ratio -0.40975879 0.18473804 0.45277063 -0.25776775
## horsepower 1.00000000 0.09940209 -0.81552839 0.81579004
## peak_rpm 0.09940209 1.00000000 -0.05227882 0.08656521
## highway_mpg -0.81552839 -0.05227882 1.00000000 -0.79199127
## price 0.81579004 0.08656521 -0.79199127 1.00000000
In the graph shown below, we can see that the first eigenvalue of the covariance matrix obtained with the data is larger that the one calculated with MCD. The rest are more similar.
# Compare eigenvalues of both covariance matrices
eval_S = eigen(S)$values
eval_S_MCD = eigen(S_MCD)$values
min_y = min(cbind(eval_S,eval_S_MCD)) - 1
max_y = max(cbind(eval_S,eval_S_MCD)) + 1
plot(1:13,eval_S,col="blue",type="b",xlab="Number",ylab="Eigenvalues",pch=19,
ylim=c(min_y,max_y),main="Comparison of eigenvalues")
points(1:13,eval_S_MCD,col="red",type="b",pch=19)
legend(7,60000000, legend=c("Eigenvalues of S","Eigenvalues of S MCD"),
col=c("Blue","Red"),lty=1,cex=1.2)A comparison between the correlation matrixes can also help in order to know the influence of the outliers.
# Compare eigenvalues of both correlation matrices
par(mfrow=c(1,2))
corrplot(R)
corrplot(R_MCD)In this case, we can see that there is not a big difference in the correlations obtained with the data and through the MCD estimation, but we can see that there are some that have been lowered such as horsepower and wheel_base and bore and width.
A graphical analysis has been carried out in order to detect the outliers. In this graphs we are representing the Mahalanobis distances, and the outliers are graphed in green.
# Show the squared Mahalanobis distances with the final set of non_outliers
n = nrow(data_gas)
p = ncol(data_gas)
X_sq_Mah_MCD = MCD_est$mah
col_outliers_Mah_MCD = rep("black",n)
outliers_Mah_MCD = which(X_sq_Mah_MCD>qchisq(.99^(1/n),p))
outliers_Mah_MCD## [1] 10 13 14 15 16 17 18 31 48 49 50 66 67 68 69 98 99 102 104
## [20] 114 115 116 117 118 123 144 184
col_outliers_Mah_MCD[outliers_Mah_MCD] = "green"
par(mfrow=c(1,2))
plot(1:n,X_sq_Mah_MCD,pch=19,col=col_outliers_Mah_MCD,
main="Squared Mahalanobis distances",xlab="Observation",ylab="Squared Mahalanobis distance")
abline(h=qchisq(.99^(1/n),p),lwd=3,col="blue")
plot(1:n,log(X_sq_Mah_MCD),pch=19,col=col_outliers_Mah_MCD,
main="Log of squared Mahalanobis distances",
xlab="Observation",ylab="Log of squared Mahalanobis distance")
abline(h=log(qchisq(.99^(1/n),p)),lwd=3,col="blue")# Show potential outliers
ggpairs(data_gas, mapping = aes(color = col_outliers_Mah_MCD),
columns = c("wheel_base", "length", "width", "height", "curb_weight", "engine_size", "bore", "stroke", "compressio_ratio", "horsepower", "peak_rpm", "highway_mpg", "price"),
upper = list(continuous = wrap("cor", size = 1.5)),
lower = list(continuous = wrap("points", size = 0.5))) +
theme(strip.text.x = element_text(size = 4),
strip.text.y = element_text(size = 2),
axis.text.x = element_text(size = 2),
axis.text.y = element_text(size = 2))After representing the outliers in the dispersion matrix, we can see that the cars that are considered outliers are those larger, heavier and especially those that consume the most.
MCD_est = CovMcd(data_diesel,alpha=0.85,nsamp="deterministic")
# Robust mean vector
m_MCD = MCD_est$center
m_MCD## wheel_base length width height
## 104.910526 182.889474 67.621053 55.905263
## curb_weight engine_size bore stroke
## 2945.210526 137.421053 3.394737 3.486316
## compressio_ratio horsepower peak_rpm highway_mpg
## 22.010526 86.000000 4415.789474 33.947368
## price
## 16298.105263
# Robust covariance matrix
S_MCD = MCD_est$cov
S_MCD## wheel_base length width height
## wheel_base 86.795121 139.937557 24.4073146 15.9363531
## length 139.937557 241.017526 41.8823564 28.0529800
## width 24.407315 41.882356 9.0150167 4.2441723
## height 15.936353 28.052980 4.2441723 5.3926631
## curb_weight 6937.389803 11428.471180 2253.1987327 1265.8051658
## engine_size 354.285208 568.448552 124.1202378 54.7975104
## bore 2.641575 4.069196 0.7176557 0.4894598
## stroke 1.077094 1.648904 0.3717788 0.1768813
## compressio_ratio -8.396076 -12.879525 -2.2227294 -1.6383823
## horsepower 287.145148 473.019575 104.7812944 46.1514521
## peak_rpm -1887.961621 -2689.999951 -413.5384227 -334.8253887
## highway_mpg -93.570371 -152.519670 -29.4055778 -15.3759508
## price 76002.744832 124358.217679 29820.5592894 9582.6370055
## curb_weight engine_size bore stroke
## wheel_base 6937.3898 354.285208 2.64157489 1.07709362
## length 11428.4712 568.448552 4.06919621 1.64890424
## width 2253.1987 124.120238 0.71765566 0.37177878
## height 1265.8052 54.797510 0.48945981 0.17688128
## curb_weight 622537.5009 33718.919720 230.51568180 100.94752351
## engine_size 33718.9197 2020.015327 13.72864677 6.11883757
## bore 230.5157 13.728647 0.15215474 0.04245841
## stroke 100.9475 6.118838 0.04245841 0.02472079
## compressio_ratio -722.3404 -40.473753 -0.43865177 -0.12896850
## horsepower 27252.7605 1547.057917 8.98859679 4.53093521
## peak_rpm -153515.1671 -8325.450269 -109.68308740 -31.64925906
## highway_mpg -8139.1464 -435.221201 -2.82653181 -1.16512558
## price 7405479.7778 433618.172372 2052.23483068 1419.86335211
## compressio_ratio horsepower peak_rpm highway_mpg
## wheel_base -8.3960762 287.145148 -1.887962e+03 -93.570371
## length -12.8795245 473.019575 -2.690000e+03 -152.519670
## width -2.2227294 104.781294 -4.135384e+02 -29.405578
## height -1.6383823 46.151452 -3.348254e+02 -15.375951
## curb_weight -722.3404477 27252.760532 -1.535152e+05 -8139.146447
## engine_size -40.4737528 1547.057917 -8.325450e+03 -435.221201
## bore -0.4386518 8.988597 -1.096831e+02 -2.826532
## stroke -0.1289685 4.530935 -3.164926e+01 -1.165126
## compressio_ratio 1.3383806 -27.876789 3.359478e+02 8.815860
## horsepower -27.8767894 1300.333566 -5.626759e+03 -369.211616
## peak_rpm 335.9478049 -5626.758788 1.039242e+05 1910.122146
## highway_mpg 8.8158598 -369.211616 1.910122e+03 127.840328
## price -6201.3685412 361153.376098 -1.204764e+06 -96747.321187
## price
## wheel_base 76002.745
## length 124358.218
## width 29820.559
## height 9582.637
## curb_weight 7405479.778
## engine_size 433618.172
## bore 2052.235
## stroke 1419.863
## compressio_ratio -6201.369
## horsepower 361153.376
## peak_rpm -1204764.143
## highway_mpg -96747.321
## price 116329651.227
# Robust correlation matrix
R_MCD = cov2cor(S_MCD)
R_MCD## wheel_base length width height curb_weight
## wheel_base 1.0000000 0.9675253 0.8725475 0.7366134 0.9437689
## length 0.9675253 1.0000000 0.8985113 0.7781316 0.9329988
## width 0.8725475 0.8985113 1.0000000 0.6087067 0.9511160
## height 0.7366134 0.7781316 0.6087067 1.0000000 0.6908479
## curb_weight 0.9437689 0.9329988 0.9511160 0.6908479 1.0000000
## engine_size 0.8461123 0.8146846 0.9197758 0.5250273 0.9508534
## bore 0.7268964 0.6719571 0.6127596 0.5403469 0.7489882
## stroke 0.7353171 0.6755228 0.7875356 0.4844502 0.8137327
## compressio_ratio -0.7790026 -0.7171099 -0.6399021 -0.6098511 -0.7913514
## horsepower 0.8547251 0.8449430 0.9677721 0.5511329 0.9578563
## peak_rpm -0.6286184 -0.5374890 -0.4272422 -0.4472582 -0.6035457
## highway_mpg -0.8882941 -0.8688956 -0.8661887 -0.5856072 -0.9123514
## price 0.7563737 0.7426855 0.9208466 0.3825940 0.8702123
## engine_size bore stroke compressio_ratio horsepower
## wheel_base 0.8461123 0.7268964 0.7353171 -0.7790026 0.8547251
## length 0.8146846 0.6719571 0.6755228 -0.7171099 0.8449430
## width 0.9197758 0.6127596 0.7875356 -0.6399021 0.9677721
## height 0.5250273 0.5403469 0.4844502 -0.6098511 0.5511329
## curb_weight 0.9508534 0.7489882 0.8137327 -0.7913514 0.9578563
## engine_size 1.0000000 0.7830827 0.8658855 -0.7784063 0.9545570
## bore 0.7830827 1.0000000 0.6922929 -0.9720471 0.6390308
## stroke 0.8658855 0.6922929 1.0000000 -0.7090269 0.7991519
## compressio_ratio -0.7784063 -0.9720471 -0.7090269 1.0000000 -0.6682294
## horsepower 0.9545570 0.6390308 0.7991519 -0.6682294 1.0000000
## peak_rpm -0.5746084 -0.8722451 -0.6244160 0.9007901 -0.4840303
## highway_mpg -0.8564436 -0.6408807 -0.6554019 0.6739706 -0.9055533
## price 0.8945094 0.4877970 0.8372790 -0.4969956 0.9285795
## peak_rpm highway_mpg price
## wheel_base -0.6286184 -0.8882941 0.7563737
## length -0.5374890 -0.8688956 0.7426855
## width -0.4272422 -0.8661887 0.9208466
## height -0.4472582 -0.5856072 0.3825940
## curb_weight -0.6035457 -0.9123514 0.8702123
## engine_size -0.5746084 -0.8564436 0.8945094
## bore -0.8722451 -0.6408807 0.4877970
## stroke -0.6244160 -0.6554019 0.8372790
## compressio_ratio 0.9007901 0.6739706 -0.4969956
## horsepower -0.4840303 -0.9055533 0.9285795
## peak_rpm 1.0000000 0.5240453 -0.3464962
## highway_mpg 0.5240453 1.0000000 -0.7933408
## price -0.3464962 -0.7933408 1.0000000
In the graph shown below, we can see that the first eigenvalue of the covariance matrix obtained with the data is smaller that the one calculated with MCD. The rest are more similar.
# Compare eigenvalues of both covariance matrices
eval_S = eigen(S)$values
eval_S_MCD = eigen(S_MCD)$values
min_y = min(cbind(eval_S,eval_S_MCD)) - 1
max_y = max(cbind(eval_S,eval_S_MCD)) + 1
plot(1:13,eval_S,col="blue",type="b",xlab="Number",ylab="Eigenvalues",pch=19,
ylim=c(min_y,max_y),main="Comparison of eigenvalues")
points(1:13,eval_S_MCD,col="red",type="b",pch=19)
legend(7,60000000, legend=c("Eigenvalues of S","Eigenvalues of S MCD"),
col=c("Blue","Red"),lty=1,cex=1.2)A comparison between the correlation matrixes can also help in order to know the influence of the outliers.
# Compare eigenvalues of both correlation matrices
par(mfrow=c(1,2))
corrplot(R)
corrplot(R_MCD)In this case, we can see that there is a big difference in the correlations obtained with the data and through the MCD estimation, but it is not a reliable study as there are only 20 cars fueled by diesel and 13 predictors, so the number of observations is not large enough to draw conclusions.
A graphical analysis has been carried out in order to detect the outliers. In this graphs we are representing the Mahalanobis distances, and the outliers are graphed in green.
# Show the squared Mahalanobis distances with the final set of non_outliers
n = nrow(data_diesel)
p = ncol(data_diesel)
X_sq_Mah_MCD = MCD_est$mah
col_outliers_Mah_MCD = rep("black",n)
outliers_Mah_MCD = which(X_sq_Mah_MCD>qchisq(.99^(1/n),p))
outliers_Mah_MCD## [1] 7
col_outliers_Mah_MCD[outliers_Mah_MCD] = "green"
par(mfrow=c(1,2))
plot(1:n,X_sq_Mah_MCD,pch=19,col=col_outliers_Mah_MCD,
main="Squared Mahalanobis distances",xlab="Observation",ylab="Squared Mahalanobis distance")
abline(h=qchisq(.99^(1/n),p),lwd=3,col="blue")
plot(1:n,log(X_sq_Mah_MCD),pch=19,col=col_outliers_Mah_MCD,
main="Log of squared Mahalanobis distances",
xlab="Observation",ylab="Log of squared Mahalanobis distance")
abline(h=log(qchisq(.99^(1/n),p)),lwd=3,col="blue")In this plot, we can see that there might be one outlier among the diesel cars.
As we saw in the previous section, most of the variables were highly skewed, hence we are going to apply some transformations in order to make them more symmetrical as the Principle Components Analysis assumes normality in the data.
data_pca <- data_num
data_pca$wheel_base <- data_num$wheel_base
data_pca$length <- data_num$length
data_pca$width <- sqrt(data_num$width)
data_pca$height <- sqrt(data_num$height)
data_pca$curb_weight <- data_num$curb_weight
data_pca$engine_size <- log(data_num$engine_size)
data_pca$bore <- (data_num$bore)^2
data_pca$stroke <- (data_num$stroke)^2
data_pca$compressio_ratio <- log(data_num$compressio_ratio)
data_pca$horsepower <- log(data_num$horsepower)
data_pca$peak_rpm <- log(data_num$peak_rpm)
data_pca$highway_mpg <- (data_num$highway_mpg)
data_pca$price <- log(data_num$price)We calculate the main characteristics of the transformed data.
(m = colMeans(data_pca, na.rm = T))## wheel_base length width height
## 98.756585 174.049268 8.117307 7.327839
## curb_weight engine_size bore stroke
## 2555.565854 4.800184 11.096593 10.624848
## compressio_ratio horsepower peak_rpm highway_mpg
## 2.267389 4.577552 8.535553 30.751220
## price
## 9.340520
(S_pca <- cov(data_pca))## wheel_base length width height
## wheel_base 36.2617824 64.9751887 0.628432373 0.590580132
## length 64.9751887 152.2086882 1.364495651 1.006668225
## width 0.6284324 1.3644957 0.017220635 0.006068115
## height 0.5905801 1.0066682 0.006068115 0.027786064
## curb_weight 2434.2967456 5638.3362004 59.281172679 25.340444097
## engine_size 1.0118760 2.5548097 0.028084803 0.005574286
## bore 5.5355847 13.7646896 0.134505851 0.069384137
## stroke 2.3826002 4.0467819 0.053054757 0.001364226
## compressio_ratio 0.3823485 0.4627282 0.005695758 0.012029935
## horsepower 0.8724088 2.5907102 0.029989803 -0.003967362
## peak_rpm -0.2026168 -0.3439416 -0.002835638 -0.004950943
## highway_mpg -22.5623242 -59.8680751 -0.613519821 -0.119513877
## price 1.9164296 4.7736147 0.049820214 0.015966360
## curb_weight engine_size bore stroke
## wheel_base 2434.29675 1.011875998 5.535584702 2.382600223
## length 5638.33620 2.554809705 13.764689603 4.046781898
## width 59.28117 0.028084803 0.134505851 0.053054757
## height 25.34044 0.005574286 0.069384137 0.001364226
## curb_weight 271107.87432 128.809554691 620.468585256 210.266293895
## engine_size 128.80955 0.080069713 0.360089264 0.174499282
## bore 620.46859 0.360089264 3.487034348 0.073272214
## stroke 210.26629 0.174499282 0.073272214 4.200952379
## compressio_ratio 17.52156 0.001615863 -0.002199066 0.082792260
## horsepower 140.30672 0.078871207 0.369786629 0.051282575
## peak_rpm -13.12866 -0.007481764 -0.055027454 -0.038049459
## highway_mpg -2859.41736 -1.369197015 -6.766536511 -0.334482166
## price 226.46503 0.116871838 0.528439881 0.109801255
## compressio_ratio horsepower peak_rpm highway_mpg
## wheel_base 0.382348516 0.872408843 -0.202616816 -2.256232e+01
## length 0.462728218 2.590710176 -0.343941560 -5.986808e+01
## width 0.005695758 0.029989803 -0.002835638 -6.135198e-01
## height 0.012029935 -0.003967362 -0.004950943 -1.195139e-01
## curb_weight 17.521555536 140.306720084 -13.128664344 -2.859417e+03
## engine_size 0.001615863 0.078871207 -0.007481764 -1.369197e+00
## bore -0.002199066 0.369786629 -0.055027454 -6.766537e+00
## stroke 0.082792260 0.051282575 -0.038049459 -3.344822e-01
## compressio_ratio 0.079656005 -0.025304521 -0.011303940 5.789827e-01
## horsepower -0.025304521 0.123294450 0.005519330 -2.016952e+00
## peak_rpm -0.011303940 0.005519330 0.009203295 -3.338861e-02
## highway_mpg 0.578982748 -2.016951545 -0.033388613 4.742310e+01
## price 0.009837654 0.137505926 -0.005619887 -2.649051e+00
## price
## wheel_base 1.916429581
## length 4.773614667
## width 0.049820214
## height 0.015966360
## curb_weight 226.465034900
## engine_size 0.116871838
## bore 0.528439881
## stroke 0.109801255
## compressio_ratio 0.009837654
## horsepower 0.137505926
## peak_rpm -0.005619887
## highway_mpg -2.649050854
## price 0.252050249
(R_pca <- cor(data_pca))## wheel_base length width height curb_weight
## wheel_base 1.0000000 0.8745875 0.7952605 0.588356757 0.7763863
## length 0.8745875 1.0000000 0.8428065 0.489500484 0.8777285
## width 0.7952605 0.8428065 1.0000000 0.277405927 0.8676032
## height 0.5883568 0.4895005 0.2774059 1.000000000 0.2919642
## curb_weight 0.7763863 0.8777285 0.8676032 0.291964226 1.0000000
## engine_size 0.5938388 0.7318207 0.7563323 0.118179340 0.8742646
## bore 0.4922784 0.5974734 0.5488940 0.222904360 0.6381468
## stroke 0.1930424 0.1600355 0.1972540 0.003992996 0.1970265
## compressio_ratio 0.2249705 0.1328914 0.1537863 0.255705750 0.1192319
## horsepower 0.4125948 0.5980360 0.6508445 -0.067782347 0.7674245
## peak_rpm -0.3507351 -0.2905984 -0.2252447 -0.309601287 -0.2628317
## highway_mpg -0.5440819 -0.7046616 -0.6789051 -0.104114173 -0.7974648
## price 0.6339058 0.7706977 0.7562014 0.190787030 0.8663363
## engine_size bore stroke compressio_ratio
## wheel_base 0.59383882 0.492278438 0.193042407 0.224970501
## length 0.73182072 0.597473418 0.160035451 0.132891438
## width 0.75633235 0.548894003 0.197254026 0.153786326
## height 0.11817934 0.222904360 0.003992996 0.255705750
## curb_weight 0.87426457 0.638146834 0.197026533 0.119231864
## engine_size 1.00000000 0.681471858 0.300874672 0.020233044
## bore 0.68147186 1.000000000 0.019144205 -0.004172545
## stroke 0.30087467 0.019144205 1.000000000 0.143122067
## compressio_ratio 0.02023304 -0.004172545 0.143122067 1.000000000
## horsepower 0.79380273 0.563963928 0.071256450 -0.255338964
## peak_rpm -0.27561204 -0.307170596 -0.193509791 -0.417492715
## highway_mpg -0.70264643 -0.526190922 -0.023697587 0.297893754
## price 0.82268234 0.563668685 0.106706251 0.069428650
## horsepower peak_rpm highway_mpg price
## wheel_base 0.41259477 -0.35073511 -0.54408192 0.63390580
## length 0.59803602 -0.29059843 -0.70466160 0.77069772
## width 0.65084454 -0.22524466 -0.67890508 0.75620135
## height -0.06778235 -0.30960129 -0.10411417 0.19078703
## curb_weight 0.76742447 -0.26283174 -0.79746479 0.86633633
## engine_size 0.79380273 -0.27561204 -0.70264643 0.82268234
## bore 0.56396393 -0.30717060 -0.52619092 0.56366869
## stroke 0.07125645 -0.19350979 -0.02369759 0.10670625
## compressio_ratio -0.25533896 -0.41749271 0.29789375 0.06942865
## horsepower 1.00000000 0.16384867 -0.83412039 0.78002062
## peak_rpm 0.16384867 1.00000000 -0.05053959 -0.11668428
## highway_mpg -0.83412039 -0.05053959 1.00000000 -0.76621697
## price 0.78002062 -0.11668428 -0.76621697 1.00000000
We can see that the scale of variances is very different, so we are calculating PCA based on the correlation matrix.
p1 <- prcomp(data_pca, scale = TRUE)# Explained variability
fviz_eig(p1,addlabels = T,ylim=c(0,60))We can see that with the first 3 components, we are able to explain 77.9% of the variance of the original data. Therefore, we are only representing graphically these three components.
The correlation between the original data and the first three components is calculated in order to explain the components and how they behave. Moreover, this correlations will be shown graphically with the help of biplots.
cor(data_pca, p1$x[,1:3])## PC1 PC2 PC3
## wheel_base 0.82102846 -0.3545504773 0.18141494
## length 0.92073283 -0.1653848829 0.14575839
## width 0.89670250 -0.0669287444 -0.02436096
## height 0.34509406 -0.5959076626 0.51992260
## curb_weight 0.96848435 0.0003158146 -0.04013278
## engine_size 0.89590331 0.1053649652 -0.26254576
## bore 0.71474702 0.0145072936 0.06503974
## stroke 0.20198529 -0.2307933271 -0.80112039
## compressio_ratio 0.07163726 -0.7508055161 -0.16157392
## horsepower 0.78632884 0.5342303795 -0.07861467
## peak_rpm -0.26932926 0.7036507788 0.16706329
## highway_mpg -0.81851178 -0.4229832350 -0.10574682
## price 0.89061238 0.1432471789 -0.02187952
par(mfrow = c(1,3))
biplot(p1, main="PC1 vs PC2", cex=0.7)
biplot(p1, choices=c(1,3), main="PC1 vs PC3", cex=0.7)
biplot(p1, choices=c(3,2), main="PC2 vs PC3", cex=0.7)The first component has high positive correlations with the variables wheel_base, length, width, curb_weight, engine_size, bore, horsepower and price but it has a negative correlation with the highway_mpg. This means that the big, non environmentally friendly cars with a lot of power will be on the right side and the smaller cars with less consumption will be on the left.
The second component has a negative correlation with the height and compressio_ratio and a positive correlation with peak_rpm. This means that the cars which are tall and more efficient will be at the bottom of the graph while the shorter and less efficient cars will be at the top (PC1 vs PC2).
The third component has a negative correlation with stroke, which indicates the engine's displacement. The cars on the right side will have a longer engine displacement (PC3 vs PC2). Moreover, we can see that there is also positive correlation with the variable height, so the taller cars will be displayed on the left side of the biplot (PC3 vs PC2).
The representation of the first 3 components is shown below.
color_1 <- "deepskyblue2"
color_2 <- "darkorchid2"
colors_X <- c(color_1,color_2)[1*(data$fuel_type=="gas")+1]
pairs(p1$x[,1:3],col=colors_X,pch=19,main="The first three PCs")We can see that the \(2^{nd}\) component is the most useful to identify both groups, being the cars fueled by diesel in blue and the cars fueled by gas in purple. Therefore, we can say that the diesel cars are usually shorter and more efficient and the gas cars are taller and less efficient. This conclusion verifies the assumptions made in the previous sections.
In this case, we are going to transform the numeric data linearly in order to obtain a new matrix, Z, with maximally independent and non-Gaussian variables.
The way this is achieved is by maximizing the negative entropy.
# Matrix dimension
n <- nrow(data_pca)
p <- ncol(data_pca)
# ICs after data transformation
X_trans_ica <- ica::icafast(data_pca,nc=p,alg="par")
# The ICA scores
Z <- X_trans_ica$S
colnames(Z) <- sprintf("IC-%d",seq(1,13))
# Re-scale to have S_z=I
Z <- Z * sqrt((n-1)/n)
# Histograms of the ICA scores obtained
par(mar=c(1,1,1,1))
par(mfrow=c(4,4))
sapply(colnames(Z),function(cname){hist(as.data.frame(Z)[[cname]],
main=cname,col=color_1,xlab="")})## IC-1 IC-2
## breaks numeric,11 numeric,10
## counts integer,10 integer,9
## density numeric,10 numeric,9
## mids numeric,10 numeric,9
## xname "as.data.frame(Z)[[cname]]" "as.data.frame(Z)[[cname]]"
## equidist TRUE TRUE
## IC-3 IC-4
## breaks numeric,10 numeric,9
## counts integer,9 integer,8
## density numeric,9 numeric,8
## mids numeric,9 numeric,8
## xname "as.data.frame(Z)[[cname]]" "as.data.frame(Z)[[cname]]"
## equidist TRUE TRUE
## IC-5 IC-6
## breaks numeric,8 numeric,13
## counts integer,7 integer,12
## density numeric,7 numeric,12
## mids numeric,7 numeric,12
## xname "as.data.frame(Z)[[cname]]" "as.data.frame(Z)[[cname]]"
## equidist TRUE TRUE
## IC-7 IC-8
## breaks numeric,11 numeric,10
## counts integer,10 integer,9
## density numeric,10 numeric,9
## mids numeric,10 numeric,9
## xname "as.data.frame(Z)[[cname]]" "as.data.frame(Z)[[cname]]"
## equidist TRUE TRUE
## IC-9 IC-10
## breaks numeric,10 numeric,9
## counts integer,9 integer,8
## density numeric,9 numeric,8
## mids numeric,9 numeric,8
## xname "as.data.frame(Z)[[cname]]" "as.data.frame(Z)[[cname]]"
## equidist TRUE TRUE
## IC-11 IC-12
## breaks numeric,11 numeric,9
## counts integer,10 integer,8
## density numeric,10 numeric,8
## mids numeric,10 numeric,8
## xname "as.data.frame(Z)[[cname]]" "as.data.frame(Z)[[cname]]"
## equidist TRUE TRUE
## IC-13
## breaks numeric,11
## counts integer,10
## density numeric,10
## mids numeric,10
## xname "as.data.frame(Z)[[cname]]"
## equidist TRUE
# Compute the neg-entropy of the columns in Z and sort them in decreasing order of neg-entropy
neg_entropy <- function(z){1/12 * mean(z^3)^2 + 1/48 * mean(z^4)^2}
Z_neg_entropy <- apply(Z,2,neg_entropy)
ic_sort <- sort(Z_neg_entropy,decreasing=TRUE,index.return=TRUE)$ix
ic_sort## [1] 6 11 13 8 3 7 4 9 2 10 12 5 1
# Plot the sorted neg-entropy and define the ICs sorted by neg-entropy
par(mfrow=c(1,1))plot(Z_neg_entropy[ic_sort],type="b",col=color_1,pch=19,
ylab="Neg-entropy",main="Neg-entropies",lwd=3)After plotting, the negative entropies we can see that there are 4 ICs larger than the other ones.
set.seed(10040485)
Z_ic_imp <- Z[,ic_sort]
# Plot the two ICs with largest neg-entropy
par(mfrow=c(1,1))
plot(Z_ic_imp[,1:2],pch=19,col=colors_X,
xlab="First IC scores",ylab="Second IC scores",main="First two IC scores")# Let's identify the points with large negative values of the first IC
which(Z_ic_imp[,1]<(-1))## [1] 11 12 13 14 15 17 19 41 66 67 98 99 131 132 155 163 205
# Let's identify the points with large positive values of the second IC
which(Z_ic_imp[,2]>(6))## [1] 156
With the first two scores, its is pointed out that some cars fueled by gas are outliers.
In order to see what IC can show the differentiation between two groups, the scatterplot matrix with the highest and lowest neg-entropy will be plotted.
# Plot the ICs with the highest and lowest entropy
pairs(Z_ic_imp[,c(1:5,12:13)],col=colors_X,pch=19,
main="ICs with the highest and lowest entropy")plot(Z_ic_imp[,12:13],pch=19,col=colors_X,
xlab="5-th IC scores",ylab="1-th IC scores",main="Last two IC scores")In often cases, the differentiation between two groups can be seen in the pairs with the lowest negative entropy as it is where the minimum values of the kurtosis are attained. However, as it is shown in both graphs, we do not obtained a clear separation in this case.
plot(Z_ic_imp[,6:7],pch=19,col=colors_X,
xlab="7-th IC scores",ylab="4-th IC scores")Nevertheless, looking at the plot of the ICs we can see that the seventh IC had two different groups, so we decided to check them and we obtained that the formed groups are differentiated by the variable fuel_type.
The correlation between the data and the ICs is calculated:
# Correlation data vs ICs
corrplot(cor(data_pca,Z_ic_imp),is.corr=T)In this case, the IC7 is highly positively correlated with the compressio-ratio and that is why this IC is able to differentiate between diesel and gas cars. On the other hand, the IC3 is negatively correlated with highway_mpg.
Finally, we check the correlation between the PCA components and the ICs.
# Correlation PCA vs ICs
corrplot(cor(prcomp(data_pca)$x,Z),is.corr=T)As it can be seen, the correlations are not very high for the most part, which is the most common situation. However, there are some that could be highlighted, for example the negative correlation between the IC4 and PC13 and the positive correlation between IC7 and PC3.
r <- 3
data_pca<-scale(data_pca)
p<-ncol(data_pca)
# Initial estimation of M and Sigma_nu
M_0 <- p1$rotation[,1:r] %*% diag(p1$sdev[1:r])
M_0## [,1] [,2] [,3]
## wheel_base 0.82102846 -0.3545504773 0.18141494
## length 0.92073283 -0.1653848829 0.14575839
## width 0.89670250 -0.0669287444 -0.02436096
## height 0.34509406 -0.5959076626 0.51992260
## curb_weight 0.96848435 0.0003158146 -0.04013278
## engine_size 0.89590331 0.1053649652 -0.26254576
## bore 0.71474702 0.0145072936 0.06503974
## stroke 0.20198529 -0.2307933271 -0.80112039
## compressio_ratio 0.07163726 -0.7508055161 -0.16157392
## horsepower 0.78632884 0.5342303795 -0.07861467
## peak_rpm -0.26932926 0.7036507788 0.16706329
## highway_mpg -0.81851178 -0.4229832350 -0.10574682
## price 0.89061238 0.1432471789 -0.02187952
S_y <- cov(data_pca)
Sigma_nu_0 <- diag(diag(S_y - M_0 %*% t(M_0)))
Sigma_nu_0## [,1] [,2] [,3] [,4] [,5] [,6] [,7]
## [1,] 0.1672948 0.0000000 0.0000000 0.0000000 0.00000000 0.0000000 0.0000000
## [2,] 0.0000000 0.1036534 0.0000000 0.0000000 0.00000000 0.0000000 0.0000000
## [3,] 0.0000000 0.0000000 0.1908517 0.0000000 0.00000000 0.0000000 0.0000000
## [4,] 0.0000000 0.0000000 0.0000000 0.2554846 0.00000000 0.0000000 0.0000000
## [5,] 0.0000000 0.0000000 0.0000000 0.0000000 0.06042732 0.0000000 0.0000000
## [6,] 0.0000000 0.0000000 0.0000000 0.0000000 0.00000000 0.1173252 0.0000000
## [7,] 0.0000000 0.0000000 0.0000000 0.0000000 0.00000000 0.0000000 0.4846961
## [8,] 0.0000000 0.0000000 0.0000000 0.0000000 0.00000000 0.0000000 0.0000000
## [9,] 0.0000000 0.0000000 0.0000000 0.0000000 0.00000000 0.0000000 0.0000000
## [10,] 0.0000000 0.0000000 0.0000000 0.0000000 0.00000000 0.0000000 0.0000000
## [11,] 0.0000000 0.0000000 0.0000000 0.0000000 0.00000000 0.0000000 0.0000000
## [12,] 0.0000000 0.0000000 0.0000000 0.0000000 0.00000000 0.0000000 0.0000000
## [13,] 0.0000000 0.0000000 0.0000000 0.0000000 0.00000000 0.0000000 0.0000000
## [,8] [,9] [,10] [,11] [,12] [,13]
## [1,] 0.0000000 0.000000 0.0000000 0.0000000 0.0000000 0.0000000
## [2,] 0.0000000 0.000000 0.0000000 0.0000000 0.0000000 0.0000000
## [3,] 0.0000000 0.000000 0.0000000 0.0000000 0.0000000 0.0000000
## [4,] 0.0000000 0.000000 0.0000000 0.0000000 0.0000000 0.0000000
## [5,] 0.0000000 0.000000 0.0000000 0.0000000 0.0000000 0.0000000
## [6,] 0.0000000 0.000000 0.0000000 0.0000000 0.0000000 0.0000000
## [7,] 0.0000000 0.000000 0.0000000 0.0000000 0.0000000 0.0000000
## [8,] 0.2641425 0.000000 0.0000000 0.0000000 0.0000000 0.0000000
## [9,] 0.0000000 0.405053 0.0000000 0.0000000 0.0000000 0.0000000
## [10,] 0.0000000 0.000000 0.0901046 0.0000000 0.0000000 0.0000000
## [11,] 0.0000000 0.000000 0.0000000 0.4044272 0.0000000 0.0000000
## [12,] 0.0000000 0.000000 0.0000000 0.0000000 0.1399413 0.0000000
## [13,] 0.0000000 0.000000 0.0000000 0.0000000 0.0000000 0.1858111
# Estimation of M without varimax rotation
MM <- S_y - Sigma_nu_0
MM_eig <- eigen(MM)
MM_values <- MM_eig$values
MM_vectors <- MM_eig$vectors
M_1 <- MM_eig$vectors[,1:r] %*% diag(MM_eig$values[1:r])^(1/2)
M_1## [,1] [,2] [,3]
## [1,] -0.80986466 -0.369649467 0.16327782
## [2,] -0.91728681 -0.186421820 0.14566568
## [3,] -0.88244859 -0.076442188 -0.02047855
## [4,] -0.33453838 -0.583821992 0.42740464
## [5,] -0.97155623 -0.007703445 -0.04059021
## [6,] -0.89102389 0.106396574 -0.26576796
## [7,] -0.67211144 0.009007158 0.03476317
## [8,] -0.19643762 -0.212739965 -0.70063443
## [9,] -0.06649663 -0.644866889 -0.14944683
## [10,] -0.78695521 0.542982522 -0.05653694
## [11,] 0.25383158 0.607354461 0.15938112
## [12,] 0.81294693 -0.409931274 -0.12999536
## [13,] -0.87767925 0.134498062 -0.01258802
# Final estimation of M and Sigma_nu after varimax rotation for interpretability
M <- varimax(M_1)
M <- loadings(M)[1:p,1:r]
M## [,1] [,2] [,3]
## [1,] -0.70611902 -0.56387369 0.05122042
## [2,] -0.85279224 -0.40924838 0.05145709
## [3,] -0.83839551 -0.27032346 -0.09484986
## [4,] -0.20429325 -0.70235815 0.31685338
## [5,] -0.94030964 -0.22132060 -0.11163962
## [6,] -0.88023246 -0.05911946 -0.31236535
## [7,] -0.65627156 -0.14885392 -0.01317384
## [8,] -0.11398373 -0.14281393 -0.73576255
## [9,] 0.09251637 -0.61280540 -0.24190849
## [10,] -0.88965830 0.35273277 -0.03758526
## [11,] 0.09786681 0.61764037 0.26010520
## [12,] 0.89082893 -0.19051935 -0.12629866
## [13,] -0.88358029 -0.06749423 -0.05744083
Sigma_nu <- diag(diag(S_y - M %*% t(M)))
Sigma_nu## [,1] [,2] [,3] [,4] [,5] [,6] [,7]
## [1,] 0.1808189 0.0000000 0.0000000 0.0000000 0.00000000 0.0000000 0.0000000
## [2,] 0.0000000 0.1026133 0.0000000 0.0000000 0.00000000 0.0000000 0.0000000
## [3,] 0.0000000 0.0000000 0.2150217 0.0000000 0.00000000 0.0000000 0.0000000
## [4,] 0.0000000 0.0000000 0.0000000 0.3645612 0.00000000 0.0000000 0.0000000
## [5,] 0.0000000 0.0000000 0.0000000 0.0000000 0.05437158 0.0000000 0.0000000
## [6,] 0.0000000 0.0000000 0.0000000 0.0000000 0.00000000 0.1241236 0.0000000
## [7,] 0.0000000 0.0000000 0.0000000 0.0000000 0.00000000 0.0000000 0.5469766
## [8,] 0.0000000 0.0000000 0.0000000 0.0000000 0.00000000 0.0000000 0.0000000
## [9,] 0.0000000 0.0000000 0.0000000 0.0000000 0.00000000 0.0000000 0.0000000
## [10,] 0.0000000 0.0000000 0.0000000 0.0000000 0.00000000 0.0000000 0.0000000
## [11,] 0.0000000 0.0000000 0.0000000 0.0000000 0.00000000 0.0000000 0.0000000
## [12,] 0.0000000 0.0000000 0.0000000 0.0000000 0.00000000 0.0000000 0.0000000
## [13,] 0.0000000 0.0000000 0.0000000 0.0000000 0.00000000 0.0000000 0.0000000
## [,8] [,9] [,10] [,11] [,12] [,13]
## [1,] 0.0000000 0.0000000 0.00000000 0.0000000 0.0000000 0.000000
## [2,] 0.0000000 0.0000000 0.00000000 0.0000000 0.0000000 0.000000
## [3,] 0.0000000 0.0000000 0.00000000 0.0000000 0.0000000 0.000000
## [4,] 0.0000000 0.0000000 0.00000000 0.0000000 0.0000000 0.000000
## [5,] 0.0000000 0.0000000 0.00000000 0.0000000 0.0000000 0.000000
## [6,] 0.0000000 0.0000000 0.00000000 0.0000000 0.0000000 0.000000
## [7,] 0.0000000 0.0000000 0.00000000 0.0000000 0.0000000 0.000000
## [8,] 0.4252654 0.0000000 0.00000000 0.0000000 0.0000000 0.000000
## [9,] 0.0000000 0.5573905 0.00000000 0.0000000 0.0000000 0.000000
## [10,] 0.0000000 0.0000000 0.08267505 0.0000000 0.0000000 0.000000
## [11,] 0.0000000 0.0000000 0.00000000 0.5412877 0.0000000 0.000000
## [12,] 0.0000000 0.0000000 0.00000000 0.0000000 0.1541749 0.000000
## [13,] 0.0000000 0.0000000 0.00000000 0.0000000 0.0000000 0.211431
# Understanding the relationship between observable variables and factors
# First factor
plot(1:p,M[,1],pch=19,col=color_1,xlab="",ylab="Loadings",main="Loadings for the first factor")
abline(h=0)
text(1:p,M[,1],labels=colnames(data_pca),pos=1,col=color_2,cex=0.75)# Second factor
## 4 is height and 9 is compressio_ratio
plot(1:p,M[,2],pch=19,col=color_1,xlab="",ylab="Loadings",main="Loadings for the second factor")
abline(h=0)
text(1:p,M[,2],labels=colnames(data_pca),pos=1,col=color_2,cex=0.75)# Third factor
# 8 is stroke
plot(1:p,M[,3],pch=19,col=color_1,xlab="",ylab="Loadings",main="Loadings for the third factor")
abline(h=0)
text(1:p,M[,3],labels=colnames(data_pca),pos=1,col=color_2,cex=0.75)# Communalities
comM <- diag(M %*% t(M))
comM## [1] 0.8191811 0.8973867 0.7849783 0.6354388 0.9456284 0.8758764 0.4530234
## [8] 0.5747346 0.4426095 0.9173250 0.4587123 0.8458251 0.7885690
names(comM) <- colnames(data_pca)
plot(1:p,sort(comM,decreasing=TRUE),pch=20,col=color_1,xlim=c(0,15),
xlab="Variables",ylab="Communalities",main="Communalities")
text(1:p,sort(comM,decreasing=TRUE),labels=names(sort(comM,decreasing=TRUE)),
pos=4,col=color_2,cex=0.75)# Estimate the factor scores
Fact <- data_pca %*% solve(Sigma_nu) %*% M %*% solve(t(M) %*% solve(Sigma_nu) %*% M)
colnames(Fact) <- c("Factor 1","Factor 2","Factor 3")
# See that the factors are uncorrelated
pairs(Fact,pch=19,col=color_1)corrplot(cor(Fact),order="hclust")# Estimate the residuals
Nu <- data_pca - Fact %*% t(M)
corrplot(cor(Nu),order="hclust")Watching the plots that represent the relationship between variables and factors, we obtain that the first factor symbolizes the smallness of the cars as the cars with higher values are the least large, least powerful and with less consumption in highgway (most miles per gallon). The second factor is an index of power and poor efficiency (compression_ratio, variable number 9, means efficiency). The third factor symbolizes tall cars with a short engine displacement.
If we analyze the interpretations of the biplots in the PCA and the interpretations of the factor analysis we find out that the relationship between the principal components and the variables is the same as the relationship of variables and factor but in the case of the first elements they present the opposite signs: if the PC1 is related with big and powerful cars, the first factor is an index of small and less powerful cars.
Then, in the communalities plot we can see that the variables that are better explained by the factors are curb_weight, length, horsepower, engine_size, highway_mpg and wheel_base and clearly the worst explained one is compressio_ratio. Afterwards, we compute the estimations of the factors' scores and with the plots we check that there is practically no correlation between them. Finally, with the representation of the correlation matrix of the residuals we find out that there exists some correlations that the factor model is not able to explain.
pairs(p1$x[,1:2],pch=19,main="The first two PCs")These last two graphs are tricky. When the first component is represented in the x-axis (plot on the bottom left) it does not seem that there is an obvious differentiation between two groups. On the other hand, when the second component is on the x-axis it may look like there is an obvious distinctness between the group on the left with few components and the group on the right with a larger number of instances. Hereafter, it is explained how the clustering was carried out.
First of all, it has to be pointed out that for this methods it is necessary to standarize the variables.
When we use the 'Within-cluster sum of squares' as a criteria to determine the optimal number of clusters, the result obtained is not clear at all as there is no evident 'elbow' in the plot.
color_1 <- "deepskyblue2"
color_2 <- "darkorchid4"
color_3 <- "seagreen2"
color_4 <- "orange2"
data_nume<-scale(data_num)
fviz_nbclust(data_nume,kmeans,method="wss",k.max=10)Therefore, we proceed to determine the optimal value of 'k' with the 'Average silhouette'. We obtained that the best option was using k=2 as it presented the highest average as it can be seen in the plot below.
fviz_nbclust(data_nume,kmeans,method="silhouette",k.max=10)Then we carry out the k-means approach and we represent the observations with the color of their respective group. We can observe that there is a cluster on the right with fewer instances than the one located on the left. Through the silhouette plot we can denote that all of the observations (except one) belong to the cluster whose mean is closer to that observation (because all the values are positive). That is a good indicative as the negative value means that the observation should be part of the other group because it is closer to the other mean. We also obtained that the average silhouette width for this method is 0.33.
kmeans_X <- kmeans(data_nume,centers=2,iter.max=1000,nstart=100)
colors_kmeans_X <- c(color_1,color_2)[kmeans_X$cluster]
par(mfrow=c(1,2))
plot(p1$x[,1:2],pch=19,col=colors_kmeans_X,xlab="First PC",ylab="Second PC")
sil_kmeans_X <- silhouette(kmeans_X$cluster,dist(data_nume,"euclidean"))
plot(sil_kmeans_X,col=color_1)Another possibility for clustering is the k-medoids clustering which instead of means, uses medoids (center observation of the cluster) and instead of the the Euclidean distance it uses the Manhattan (or Gower) distance. There are different algorithms in the k-medoid method:
PAM (Partition Around Medoids) is the standard k-medoids algorithm. The resultant clustering gives a more balanced partition as it can be seen in the first plot. But even if the average silhouette (0.35) is larger than the previous one, this option is worse as there are more negative values and therefore more wrongly assigned observations.
pam_X <- pam(data_nume,k=2,metric="manhattan",stand=FALSE)
colors_pam_X <- c(color_1,color_2)[pam_X$cluster]
par(mfrow=c(1,2))
plot(p1$x[,1:2],pch=19,col=colors_pam_X,xlab="First PC",ylab="Second PC")
sil_pam_X <- silhouette(pam_X$cluster,dist(data_nume,method="manhattan"))
plot(sil_pam_X,col=color_1)Another option is to use the Gower distance with the PAM algorithm. We have obtained that adding the non-numerical variables as binary variables (through one-hot encoding) results on tacking on a lot of noise and therefore the result is less likeable. That is the reason why we use just the numeric data.
#Just numeric data
X_Gower <- daisy(data_nume,metric="gower")
summary(X_Gower)## 20910 dissimilarities, summarized :
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.1290 0.1849 0.1983 0.2507 0.6625
## Metric : mixed ; Types = I, I, I, I, I, I, I, I, I, I, I, I, I
## Number of objects : 205
X_Gower_mat <- as.matrix(X_Gower)
X_K <- matrix(NA,nrow=1,ncol=19)
for (i in 1:19){
pam_X_Gower_mat <- pam(X_Gower_mat,k=i+1,diss=TRUE)
X_K[i] <- pam_X_Gower_mat$silinfo$avg.width
}
plot(2:20,X_K,pch=19,col="deepskyblue2",xlab="Number of clusters",ylab="Average silhouette")pam_X_Gower_mat <- pam(X_Gower_mat,k=2,diss=TRUE)
colors_pam_2 <- c(color_1,color_2)[pam_X_Gower_mat$cluster]
par(mfrow=c(1,2))
plot(p1$x[,1:2],pch=19,col=colors_pam_2,xlab="First PC",ylab="Second PC")
sil_pam_X_Gower_mat <- silhouette(pam_X_Gower_mat$cluster,X_Gower_mat)
plot(sil_pam_X_Gower_mat,col=color_1)summary(sil_pam_X_Gower_mat)## Silhouette of 205 units in 2 clusters from silhouette.default(x = pam_X_Gower_mat$cluster, dist = X_Gower_mat) :
## Cluster sizes and average silhouette widths:
## 112 93
## 0.4635295 0.2064305
## Individual silhouette widths:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.2262 0.2323 0.3676 0.3469 0.5131 0.6225
We get a similar result as the one obtained with the Manhattan distance: there are some negative values in the silhouette graph and therefore we still prefer the k-means method.
data$symboling<-as.factor(data$symboling)
data_cat<-data[,c(1:8,14,15,17)]
data_cat <- one_hot(as.data.table(data_cat))
data_e<-cbind(data_nume,data_cat)X_Gower <- daisy(data_e,metric="gower")
summary(X_Gower)## 20910 dissimilarities, summarized :
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000144 0.128190 0.169110 0.169360 0.209430 0.351850
## Metric : mixed ; Types = I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I, I
## Number of objects : 205
X_Gower_mat <- as.matrix(X_Gower)
X_K <- matrix(NA,nrow=1,ncol=19)
for (i in 1:19){
pam_X_Gower_mat <- pam(X_Gower_mat,k=i+1,diss=TRUE)
X_K[i] <- pam_X_Gower_mat$silinfo$avg.width
}
plot(2:20,X_K,pch=19,col="deepskyblue2",xlab="Number of clusters",ylab="Average silhouette")which.max(X_K)+1## [1] 20
The aim of CLARA (Clustering for Large Applications) is computational efficiency by running PAM to random sub-samples, not to improve PAM results. In the Silhouette plot we found that the average silhoutte is slightly larger than the previous ones but there are still some negative values, that is why we still prefering the k-means solution.
clara_X <- clara(data_nume,k=2,metric="manhattan",stand=FALSE)
colors_clara_X <- c(color_1,color_2)[clara_X$cluster]
par(mfrow=c(1,2))
plot(p1$x[,1:2],pch=19,col=colors_clara_X,xlab="First PC",ylab="Second PC")
sil_clara_X <- silhouette(clara_X$cluster,dist(data_nume,method="manhattan"))
plot(sil_clara_X,col=color_1)These are methods that do not require to fix K as they explore cluster solutions for every possible value of k. As it is possible to use mixed data (using the Gower distance) we have tried different options to find out if in our case it was better to use only numeric data or the mixed dataset.
In these cases, the algorithms start from k = n and then merges clusters based on the distances between them. There are different methods to compute the distance between a new cluster and the other clusters, these are called 'linkage methods' and are the following ones:
man_dist_X <- daisy(data_nume,metric="manhattan",stand=F)
gower_dist <- daisy(data_e,metric="gower",stand=F)## Warning in daisy(data_e, metric = "gower", stand = F): binary variable(s) 14,
## 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34,
## 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54,
## 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74,
## 75, 76, 77, 78, 79 treated as interval scaled
single_X <- hclust(man_dist_X,method="single")
plot(single_X,main="Single linkage",cex=0.8)
rect.hclust(single_X,k=2,border=color_1)cl_single_X <- cutree(single_X,2)
colors_single_X <- c(color_1,color_2)[cl_single_X]
par(mfrow=c(1,2))
plot(p1$x[,1:2],pch=19,col=colors_single_X, xlab="First PC",ylab="Second PC")
sil_single_X <- silhouette(cl_single_X,man_dist_X)
plot(sil_single_X,col=color_1)single_X <- hclust(gower_dist,method="single")
plot(single_X,main="Single linkage",cex=0.8)
rect.hclust(single_X,k=2,border=color_1)cl_single_X <- cutree(single_X,2)
colors_single_X <- c(color_1,color_2)[cl_single_X]
par(mfrow=c(1,2))
plot(p1$x[,1:2],pch=19,col=colors_single_X, xlab="First PC",ylab="Second PC")
sil_single_X <- silhouette(cl_single_X,man_dist_X)
plot(sil_single_X,col=color_1)Normally this linkage method tends to create large clusters and this is what we have obtained with both approaches. The results are quite awful as we can see in both the representation of the dendogram does not look right and the plot of the Silhouette shows that there are a lot of observations wrongly assinged (negative values). These were not a good option.
complete_X <- hclust(man_dist_X,method="complete")
plot(complete_X,main="Complete linkage",cex=0.8)
rect.hclust(complete_X,k=2,border=color_1)cl_complete_X <- cutree(complete_X,2)
colors_complete_X <- c(color_1,color_2)[cl_complete_X]
par(mfrow=c(1,2))
plot(p1$x[,1:2],pch=19,col=colors_complete_X, xlab="First PC",ylab="Second PC")
sil_complete_X <- silhouette(cl_complete_X,man_dist_X)
plot(sil_complete_X,col=color_1)In this case the dendogram looks better. As it was expected the groups are a little bit more balanced (still imbalanced) and we obtained the Silhouette still presents more negative values than the k-means method.
complete_X <- hclust(gower_dist,method="complete")
plot(complete_X,main="Complete linkage",cex=0.8)
rect.hclust(complete_X,k=2,border=color_1)cl_complete_X <- cutree(complete_X,2)
colors_complete_X <- c(color_1,color_2)[cl_complete_X]
par(mfrow=c(1,2))
plot(p1$x[,1:2],pch=19,col=colors_complete_X, xlab="First PC",ylab="Second PC")
sil_complete_X <- silhouette(cl_complete_X,man_dist_X)
plot(sil_complete_X,col=color_1)Using the Gower distance the groups are also imbalanced as it can be seen in the dendogram and in the Silhouette even with a smaller average Silhouette, the results are similar to the previous approach. The conclusion is the same, k-means is still a better option.
This method was expected to give an intermediate solution between the single and the complete linkage methods, and this is what we have seen in the plots.
average_X <- hclust(man_dist_X,method="average")
plot(average_X,main="Average linkage",cex=0.8)
rect.hclust(average_X,k=2,border=color_1)cl_average_X <- cutree(average_X,2)
colors_average_X <- c(color_1,color_2)[cl_average_X]
par(mfrow=c(1,2))
plot(p1$x[,1:2],pch=19,col=colors_average_X, xlab="First PC",ylab="Second PC")
sil_average_X <- silhouette(cl_average_X,man_dist_X)
plot(sil_average_X,col=color_1)In this approach we have a more imbalanced result and eventhough the average silhouette is the largest, there are more negative values, what makes it a worse method.
average_X <- hclust(gower_dist,method="average")
plot(average_X,main="Average linkage",cex=0.8)
rect.hclust(average_X,k=2,border=color_1)cl_average_X <- cutree(average_X,2)
colors_average_X <- c(color_1,color_2)[cl_average_X]
par(mfrow=c(1,2))
plot(p1$x[,1:2],pch=19,col=colors_average_X, xlab="First PC",ylab="Second PC")
sil_average_X <- silhouette(cl_average_X,man_dist_X)
plot(sil_average_X,col=color_1)With the Gower distance and the mixed dataset the clusters are more imbalanced and the result is even worse as there are more wrongly assigned observations.
ward_X <- hclust(man_dist_X,method="ward")
plot(ward_X,main="Ward linkage",cex=0.8)
rect.hclust(ward_X,k=2,border=color_1)cl_ward_X <- cutree(ward_X,2)
colors_ward_X <- c(color_1,color_2)[cl_ward_X]
par(mfrow=c(1,2))
plot(p1$x[,1:2],pch=19,col=colors_ward_X,xlab="First PC",ylab="Second PC")
sil_ward_X <- silhouette(cl_ward_X,man_dist_X)
plot(sil_ward_X,col=color_1)ward_X <- hclust(gower_dist,method="ward")## The "ward" method has been renamed to "ward.D"; note new "ward.D2"
plot(ward_X,main="Ward linkage",cex=0.8)
rect.hclust(ward_X,k=2,border=color_1)cl_ward_X <- cutree(ward_X,2)
colors_ward_X <- c(color_1,color_2)[cl_ward_X]
par(mfrow=c(1,2))
plot(p1$x[,1:2],pch=19,col=colors_ward_X,xlab="First PC",ylab="Second PC")
sil_ward_X <- silhouette(cl_ward_X,man_dist_X)
plot(sil_ward_X,col=color_1)As it was expected, the ward linkage method provides similar results to the k-means method. More balanced groups than the other linkage methods are found but the Silhouette plots show that there are more observations in clusters that they do not belong to than with the k-means method. Therefore, we still prefer the k-means approach.
An interesting point found out in this section is that in general with the Gower distance, lower average silhouettes are obtained (that might be bacuse the group means are closer with this distance) and more observations are wrongly assigned. Generally Manhattan distance finds better solutions.
In the Divisive Analysis Clustering (DIANA) all the observations belong to a single cluster, and at each step the cluster with the largest diameter is divided into two clusters and so on until there is a cluster for each observation.
diana_X <- diana(data_nume,metric="manhattan")
par(mfrow=c(1,2))
plot(diana_X,main="DIANA")
# The heights here are the diameters of the clusters before splitting
rect.hclust(diana_X,k=2,border=color_1)cl_diana_X <- cutree(diana_X,2)
colors_diana_X <- c(color_1,color_2)[cl_diana_X]
par(mfrow=c(1,2))
plot(p1$x[,1:2],pch=19,col=colors_diana_X,xlab="First PC",ylab="Second PC")
sil_diana_X <- silhouette(cl_diana_X,man_dist_X)
plot(sil_diana_X,col=color_1)This approach presents the best results until this moment. With the dendogram and the PC's plot we can see that the groups are balanced (as in k-means) and in the Silhouette plot no negative value is present, therefore this solution is the best one yet.
This approach does not use distances but probability to obtain results. Asuming that the observations are generated by different distributions with certain probabilities, these observations are assigned to different clusters according to the Bayes Theorem. M-clust is the most widely used model-based clustering method and it works like this:
BIC_X <- mclustBIC(p1$x[,1:2],G=1:5)
plot(BIC_X)Mclust_X <- Mclust(p1$x[,1:2],x=BIC_X)
summary(Mclust_X)## ----------------------------------------------------
## Gaussian finite mixture model fitted by EM algorithm
## ----------------------------------------------------
##
## Mclust VVI (diagonal, varying volume and shape) model with 3 components:
##
## log-likelihood n df BIC ICL
## -803.0912 205 14 -1680.704 -1712.688
##
## Clustering table:
## 1 2 3
## 45 142 18
# Plot the groups
par(mfrow=c(1,2))
plot(Mclust_X,what="classification")
# Plot the estimated densities
plot(Mclust_X,what="density")# Plot the estimated probabilities of the observations for each cluster
colors_Mclust_X <- c(color_1,color_2,color_3)[Mclust_X$classification]
par(mfrow=c(2,2))
plot(1:nrow(data_num),Mclust_X$z[,1],pch=19,col=colors_Mclust_X,main="Cluster 1",
xlab="Cancer cell",ylab="Probability of cluster 1")
plot(1:nrow(data_num),Mclust_X$z[,2],pch=19,col=colors_Mclust_X,main="Cluster 2",
xlab="Cancer cell",ylab="Probability of cluster 2")
plot(1:nrow(data_num),Mclust_X$z[,3],pch=19,col=colors_Mclust_X,main="Cluster 3",
xlab="Cancer cell",ylab="Probability of cluster 3")
# Plot the points with uncertainty
#par(mfrow=c(1,1))
plot(Mclust_X,what="uncertainty")In the first plot we can see that the model with the best results is the model with 3 clusters in which the covariance matrices are diagonal, have different volume and shape (VVI). With the two following graphs we can observe the distribution of the clusters and their diagonal densities. In the last plots the probabilities of belonging to each cluster are represented and also the observations that provoke more indecision about its assingment are shown.
sil_mclust_X <- silhouette(Mclust_X$classification,dist(data_nume,method="manhattan"))
plot(sil_mclust_X,col=color_1)Finally, we computed the Silhouette plot of this method to find out that this approach is not as good as the previous methods because there are more and larger negative values. This means that there are more points that are wrongly assigned and with a larger error magnitude.
Density-Based Spatial Clustering of Applications with Noise (DBSCAN) is an algorithm that uses distances (normally Euclidean) between observations to cluster. Two parameters have to be set, \(\epsilon\) which is a measure of closeness and \(MP\) which is the minimum number of points to form a cluster. In this case, the fixed values are \(1.05\) for \(\epsilon\) and \(10\) for \(MP\) because they present the best results.
minPts <- 10
par(mfrow=c(1,2))
kNNdistplot(p1$x[,1:2],k=minPts-1)
abline(h=1.05,lty=2,lwd=2,col=color_1)
dbscan_Z <- dbscan(p1$x[,1:2],eps=1.05,minPts=minPts)
dbscan_Z## DBSCAN clustering for 205 objects.
## Parameters: eps = 1.05, minPts = 10
## Using euclidean distances and borderpoints = TRUE
## The clustering contains 1 cluster(s) and 31 noise points.
##
## 0 1
## 31 174
##
## Available fields: cluster, eps, minPts, dist, borderPoints
colors_dbscan_Z <- c(color_1,color_2,color_3,color_4)[dbscan_Z$cluster+1]
plot(p1$x[,1:2],pch=19,col=colors_dbscan_Z,xlab="First PC",ylab="Second PC")In the first plot we can see why \(\epsilon\) is \(1.05\) as the first elbow of the graph is in that point. In the second plot it is represented that this method with these parameters finds that the best option is to form a big central cluster and that the rest of observations (colored in blue) are noise.
sil_dbscan_X <- silhouette(dbscan_Z$cluster,dist(data_nume,method="manhattan"))
plot(sil_dbscan_X,col=color_1)To measure the goodness of this method, we computed the Silhouette plot and the result is worse than the results from the DIANA algorithm. The average Silhouette is similar but there are several points that do not belong to the cluster that they were assigned, therefore we do not prefer the latter method.
As it was mentioned before, the best results were obtained with the DIANA method as no negative value was present in the Silhouette plot and therefore no wrong assingnation took place.
The surpervised classification will be carried out with the fuel_type variable, which is heavily imbalanced so the results may not be reliable.
Moreover, this part of the project will be carried out only the numerical values will be taken into account because the categorical values of the 20 cars fueled by diesel are almost constant which can cause problems with the classifications based on the Bayes Theorem. In addition, for the KNN classification, the Euclidean distance will be used, and it does not work well with categorical values.
options(digits=4)
color_3 <- "seagreen2"
color_4 <- "indianred2"
# Define the data matrix and the response vector
X <- data_num[,!names(data_num) %in% c("Class")]
Y <- data$fuel_type
c_gas <- sum(Y=="gas")
c_diesel <- sum(Y=="diesel")
c(c_gas,c_diesel)## [1] 185 20
n = nrow(data_num)
pr_gas <- c_gas/n
pr_diesel <- c_diesel/n
c(pr_gas,pr_diesel)## [1] 0.90244 0.09756
An initial graphical analysis is going to be done in order to identify how difficult it will be to solve the problem.
colors_Y <- c(color_1,color_2)[1*(Y=="diesel")+1]
parcoord(X,col=colors_Y,var.label=F,main="PCP")In this graph we can see that many of the variables do not allow to distinguish well the groups. However, there is one that clearly differenciates the data, which is compressio_ratio as we saw in the previous sections
X_skewness <- apply(X,2,skewness)
sort(X_skewness,decreasing=TRUE)## compressio_ratio engine_size price horsepower
## 2.59172 1.93337 1.81838 1.36530
## wheel_base width curb_weight highway_mpg
## 1.04251 0.89738 0.67640 0.53604
## length height peak_rpm bore
## 0.15481 0.06266 0.04847 -0.02429
## stroke
## -0.63745
plot(1:ncol(X),apply(X,2,skewness),type="h",pch=19,col=color_1,xlab="Variables",
ylab="Skewness coefficients",main="Skewness coefficients")We can see that the largest skewness coefficient is 2.59 and the minimum is -0.63, so not all the variables are very skewed.
The variables which are skewed are: - compressio_ratio
- engine_size
- price
- horsepower
X_pcs <- prcomp(X,scale=TRUE)
summary(X_pcs)## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8
## Standard deviation 2.589 1.460 1.0661 0.9126 0.8354 0.6462 0.5978 0.4721
## Proportion of Variance 0.516 0.164 0.0874 0.0641 0.0537 0.0321 0.0275 0.0171
## Cumulative Proportion 0.516 0.680 0.7670 0.8311 0.8848 0.9169 0.9444 0.9615
## PC9 PC10 PC11 PC12 PC13
## Standard deviation 0.4210 0.34191 0.30269 0.25779 0.21842
## Proportion of Variance 0.0136 0.00899 0.00705 0.00511 0.00367
## Cumulative Proportion 0.9752 0.98417 0.99122 0.99633 1.00000
Three components are necessary to explain around 76% of the variability and using four components will explain 83% of it.
# Make a plot of the first two PCs
plot(X_pcs$x[,1:2],pch=20,col=colors_Y,xlim=c(-4.5,5.5),ylim=c(-7,5),
main="First two PCs")This plot shows that there are two clearly differenciated groups with the first two components.
We divide the data into the train and test partitions.
set.seed(485)
# Obtain a training and a test sample: Take at random the 70% of the observations as the training sample and
# the other 30% of the observations as the test sample
n_train <- floor(.7*n)
n_test <- n - n_train
c(n_train,n_test)## [1] 143 62
# Obtain the indices of the observations in both data sets at random
i_train <- sort(sample(1:n,n_train))
length(i_train)## [1] 143
head(i_train)## [1] 2 3 4 5 6 7
i_test <- setdiff(1:n,i_train)
length(i_test)## [1] 62
head(i_test)## [1] 1 8 14 19 20 27
# Obtain the training data matrix and the associated response vector
X_train <- X[i_train,]
Y_train <- Y[i_train]
# Obtain the test data matrix and the associated response vector
X_test <- X[i_test,]
Y_test <- Y[i_test]
# Which are the proportions of diesel and gas cars in the training and test sample
sum(Y_train=="gas")/n_train## [1] 0.8951
sum(Y_train=="diesel")/n_train## [1] 0.1049
sum(Y_test=="gas")/n_test## [1] 0.9194
sum(Y_test=="diesel")/n_test## [1] 0.08065
# There similar to the original dataThe KNN method will be using the Euclidean distance, so the data will b scaled in order to obtained the best results posible. First, we will compute the LOOCV Error Rate (LER) in order to estimate the optimal value for k.
# Scale the data
scaled_X_train <- scale(X_train,center=F)
scaled_X_test <- scale(X_test,center=F)
# We take k ranging from 3 to 40 as the sample size is big enough and estimate the LOOCV error rate
LER <- rep(NA,37)
for (i in 3 : 40){
print(i)
knn_output <- knn.cv(scaled_X_train,Y_train,k=i)
LER[i] <- 1 - mean(knn_output==Y_train)
}## [1] 3
## [1] 4
## [1] 5
## [1] 6
## [1] 7
## [1] 8
## [1] 9
## [1] 10
## [1] 11
## [1] 12
## [1] 13
## [1] 14
## [1] 15
## [1] 16
## [1] 17
## [1] 18
## [1] 19
## [1] 20
## [1] 21
## [1] 22
## [1] 23
## [1] 24
## [1] 25
## [1] 26
## [1] 27
## [1] 28
## [1] 29
## [1] 30
## [1] 31
## [1] 32
## [1] 33
## [1] 34
## [1] 35
## [1] 36
## [1] 37
## [1] 38
## [1] 39
## [1] 40
plot(1:40,LER,pch=20,col=color_1,type="b",
xlab="k",ylab="LER",main="LER")# Take k as the one that gives the minimum LOOCV error rate
(k <- which.min(LER))## [1] 3
The optimal values for K is 3.
# Classify the responses in the test data set with the k selected
knn_Y_test <- knn(scaled_X_train,scaled_X_test,Y_train,k=k,prob=T)
head(knn_Y_test)## [1] gas gas gas gas gas gas
## Levels: diesel gas
# Number of cars classified in each group
summary(knn_Y_test)## diesel gas
## 5 57
# Confusion table
table(Y_test,knn_Y_test)## knn_Y_test
## Y_test diesel gas
## diesel 5 0
## gas 0 57
# Obtain the Test Error Rate (TER)
knn_TER <- mean(Y_test!=knn_Y_test)
knn_TER## [1] 0
# Estimated probabilities of the classifications made for the winner group
prob_knn_Y_test <- attributes(knn_Y_test)$prob
head(prob_knn_Y_test)## [1] 1 1 1 1 1 1
# Make a plot of the probabilities of the winner group
# In green, good classifications, in red, wrong classifications
colors_errors <- c(color_4,color_3)[1*(Y_test==knn_Y_test)+1]
plot(1:n_test,prob_knn_Y_test,col=colors_errors,pch=20,type="p",
xlab="Test sample",ylab="Winning probabilities",main="Winning probabilities")# Note that there have been some ties and that observations at the same distance to the observation
# to be classified have been includedlda_train <- lda(Y_train~.,data=as.data.frame(X_train))
# Estimated prior probabilities for the two groups
lda_train$prior## diesel gas
## 0.1049 0.8951
# Estimated sample mean vectors
t(lda_train$means)## diesel gas
## wheel_base 103.427 97.920
## length 180.493 172.629
## width 67.267 65.595
## height 55.560 53.555
## curb_weight 2841.067 2496.672
## engine_size 134.400 123.266
## bore 3.358 3.276
## stroke 3.490 3.220
## compressio_ratio 22.073 8.836
## horsepower 83.067 104.781
## peak_rpm 4443.333 5217.188
## highway_mpg 35.400 30.312
## price 16085.867 12629.531
# Classify the observations in the test sample
lda_test <- predict(lda_train,newdata=as.data.frame(X_test))
# The vector of classifications made can be found here
lda_Y_test <- lda_test$class
lda_Y_test## [1] gas gas gas gas gas gas gas gas gas gas
## [11] gas gas gas gas gas gas gas gas gas gas
## [21] gas gas gas gas gas gas diesel diesel gas diesel
## [31] gas gas gas gas gas gas gas gas gas gas
## [41] gas gas gas gas gas gas gas gas gas gas
## [51] gas gas gas diesel gas gas gas gas gas gas
## [61] diesel gas
## Levels: diesel gas
# Number of cars classified in each group
table(lda_Y_test)## lda_Y_test
## diesel gas
## 5 57
# Contingency table with good and bad classifications
table(Y_test,lda_Y_test)## lda_Y_test
## Y_test diesel gas
## diesel 5 0
## gas 0 57
# Test Error Rate (TER)
lda_TER <- mean(Y_test!=lda_Y_test)
lda_TER## [1] 0
# Conditional probabilities of the classifications made with the test sample
prob_lda_Y_test <- lda_test$posterior
head(prob_lda_Y_test)## diesel gas
## 1 2.508e-170 1
## 8 5.058e-170 1
## 14 1.280e-151 1
## 19 1.538e-162 1
## 20 3.739e-160 1
## 27 3.325e-159 1
# In green, good classifications, in red, wrong classifications
colors_errors <- c(color_4,color_3)[1*(Y_test==lda_Y_test)+1]
plot(1:n_test,prob_lda_Y_test[,1],col=colors_errors,pch=20,type="p",
xlab="Test sample",ylab="Probabilities of diesel",
main="Probabilities of gas (LDA)")
abline(h=0.5)qda_train <- qda(Y_train~.,data=as.data.frame(X_train))
eigen(cov(X_train[Y_train=="diesel",]))$values## [1] 7.338e+07 5.905e+04 1.442e+04 1.065e+02 2.971e+01 1.116e+01 8.154e+00
## [8] 1.335e+00 4.828e-01 1.382e-01 3.418e-02 1.219e-03 7.234e-05
summary(X_train[Y_train=="diesel",])## wheel_base length width height curb_weight
## Min. : 94.5 Min. :165 Min. :63.8 Min. :52.8 Min. :2017
## 1st Qu.: 97.3 1st Qu.:172 1st Qu.:65.5 1st Qu.:54.7 1st Qu.:2297
## Median :102.4 Median :178 Median :66.5 Median :55.5 Median :2579
## Mean :103.4 Mean :180 Mean :67.3 Mean :55.6 Mean :2841
## 3rd Qu.:109.0 3rd Qu.:189 3rd Qu.:69.3 3rd Qu.:56.4 3rd Qu.:3490
## Max. :115.6 Max. :203 Max. :71.7 Max. :58.7 Max. :3770
## engine_size bore stroke compressio_ratio horsepower
## Min. : 97 Min. :2.99 Min. :3.35 Min. :21.0 Min. : 52.0
## 1st Qu.:106 1st Qu.:3.14 1st Qu.:3.40 1st Qu.:21.5 1st Qu.: 60.0
## Median :122 Median :3.39 Median :3.47 Median :22.0 Median : 72.0
## Mean :134 Mean :3.36 Mean :3.49 Mean :22.1 Mean : 83.1
## 3rd Qu.:168 3rd Qu.:3.58 3rd Qu.:3.64 3rd Qu.:22.6 3rd Qu.:109.0
## Max. :183 Max. :3.70 Max. :3.64 Max. :23.0 Max. :123.0
## peak_rpm highway_mpg price
## Min. :4150 Min. :25.0 Min. : 7099
## 1st Qu.:4350 1st Qu.:25.0 1st Qu.: 8696
## Median :4500 Median :36.0 Median :13845
## Mean :4443 Mean :35.4 Mean :16086
## 3rd Qu.:4500 3rd Qu.:42.0 3rd Qu.:21948
## Max. :4800 Max. :50.0 Max. :31600
# Therefore, we apply qda excluding this columns of all the data sets
qda_train <- qda(Y_train ~ .,data=as.data.frame(X_train))
# Estimated unconditional probabilities
qda_train$prior## diesel gas
## 0.1049 0.8951
# Estimated sample mean vectors
t(qda_train$means)## diesel gas
## wheel_base 103.427 97.920
## length 180.493 172.629
## width 67.267 65.595
## height 55.560 53.555
## curb_weight 2841.067 2496.672
## engine_size 134.400 123.266
## bore 3.358 3.276
## stroke 3.490 3.220
## compressio_ratio 22.073 8.836
## horsepower 83.067 104.781
## peak_rpm 4443.333 5217.188
## highway_mpg 35.400 30.312
## price 16085.867 12629.531
# Classify the observations in the test sample
qda_test <- predict(qda_train,newdata=as.data.frame(X_test))
# The vector of classifications made can be found here
qda_Y_test <- qda_test$class
qda_Y_test## [1] gas gas gas gas gas gas gas gas gas gas
## [11] gas gas gas gas gas gas gas gas gas gas
## [21] gas gas gas gas gas gas diesel diesel gas diesel
## [31] gas gas gas gas gas gas gas gas gas gas
## [41] gas gas gas gas gas gas gas gas gas gas
## [51] gas gas gas diesel gas gas gas gas gas gas
## [61] diesel gas
## Levels: diesel gas
# Number of cars classified in each group
table(qda_Y_test)## qda_Y_test
## diesel gas
## 5 57
# Contingency table with good and bad classifications
table(Y_test,qda_Y_test)## qda_Y_test
## Y_test diesel gas
## diesel 5 0
## gas 0 57
# Test Error Rate (TER)
qda_TER <- mean(Y_test!=qda_Y_test)
qda_TER## [1] 0
# Conditional probabilities of the classifications made with the test sample
prob_qda_Y_test <- qda_test$posterior
head(prob_qda_Y_test)## diesel gas
## 1 0 1
## 8 0 1
## 14 0 1
## 19 0 1
## 20 0 1
## 27 0 1
# In green, good classifications, in red, wrong classifications
plot(1:n_test,prob_qda_Y_test[,1],col=colors_errors,pch=20,type="p",
xlab="Test sample",ylab="Probabilities of gas",
main="Probabilities of gas (QDA)")
abline(h=0.5)nb_train <- gaussian_naive_bayes(X_train,Y_train)
# Estimated unconditional probabilities
nb_train$prior## diesel gas
## 0.1049 0.8951
# Estimated sample mean vectors
t(nb_train$params$mu)## diesel gas
## wheel_base 103.427 97.920
## length 180.493 172.629
## width 67.267 65.595
## height 55.560 53.555
## curb_weight 2841.067 2496.672
## engine_size 134.400 123.266
## bore 3.358 3.276
## stroke 3.490 3.220
## compressio_ratio 22.073 8.836
## horsepower 83.067 104.781
## peak_rpm 4443.333 5217.188
## highway_mpg 35.400 30.312
## price 16085.867 12629.531
t(nb_train$params$sd)## diesel gas
## wheel_base 7.0171 5.3508
## length 11.8971 12.0634
## width 2.4956 1.9380
## height 1.6949 2.4858
## curb_weight 624.5451 482.9014
## engine_size 35.0404 40.8650
## bore 0.2618 0.2784
## stroke 0.1218 0.3194
## compressio_ratio 0.7156 0.7199
## horsepower 27.9322 40.2105
## peak_rpm 203.4231 425.2113
## highway_mpg 8.9427 6.6104
## price 8545.4190 7775.8375
# Classify the observations in the test sample
nb_test <- predict(nb_train,newdata=as.matrix(X_test),type="prob")
# The vector of classifications made can be found here
nb_Y_test <- c("diesel","gas")[apply(nb_test,1,which.max)]
nb_Y_test## [1] "gas" "gas" "gas" "gas" "gas" "gas" "gas" "gas"
## [9] "gas" "gas" "gas" "gas" "gas" "gas" "gas" "gas"
## [17] "gas" "gas" "gas" "gas" "gas" "gas" "gas" "gas"
## [25] "gas" "gas" "diesel" "diesel" "gas" "diesel" "gas" "gas"
## [33] "gas" "gas" "gas" "gas" "gas" "gas" "gas" "gas"
## [41] "gas" "gas" "gas" "gas" "gas" "gas" "gas" "gas"
## [49] "gas" "gas" "gas" "gas" "gas" "diesel" "gas" "gas"
## [57] "gas" "gas" "gas" "gas" "diesel" "gas"
table(nb_Y_test)## nb_Y_test
## diesel gas
## 5 57
# Contingency table with good and bad classifications
table(Y_test,nb_Y_test)## nb_Y_test
## Y_test diesel gas
## diesel 5 0
## gas 0 57
# Test Error Rate (TER)
nb_TER <- mean(Y_test!=nb_Y_test)
nb_TER## [1] 0
# Conditional probabilities of the classifications made with the test sample
prob_nb_Y_test <- nb_test
head(prob_nb_Y_test)## diesel gas
## [1,] 9.784e-88 1
## [2,] 1.183e-82 1
## [3,] 3.010e-74 1
## [4,] 6.092e-74 1
## [5,] 2.194e-75 1
## [6,] 1.390e-78 1
# In green, good classifications, in red, wrong classifications
plot(1:n_test,prob_nb_Y_test[,1],col=colors_errors,pch=20,type="p",
xlab="Test sample",ylab="Probabilities of gas",
main="Probabilities of gas (Naive Bayes)")
abline(h=0.5)lr_train <- multinom(Y_train~.,data=as.data.frame(X_train))## # weights: 15 (14 variable)
## initial value 99.120047
## iter 10 value 23.259049
## iter 20 value 0.173415
## iter 30 value 0.000495
## iter 30 value 0.000017
## final value 0.000017
## converged
# Have a look at the estimated coefficients and their standard errors
summary(lr_train)## Call:
## multinom(formula = Y_train ~ ., data = as.data.frame(X_train))
##
## Coefficients:
## Values Std. Err.
## (Intercept) 1.017e+00 0.2273
## wheel_base 7.134e-01 30.0466
## length -1.196e-01 26.9029
## width 7.348e-01 14.4214
## height -7.451e-01 25.1233
## curb_weight -1.545e-03 1.1909
## engine_size 1.484e-02 36.6575
## bore -2.301e+00 0.5050
## stroke -4.251e+00 1.2341
## compressio_ratio -3.047e+00 14.3096
## horsepower 5.409e-02 31.0416
## peak_rpm -7.352e-04 1.2466
## highway_mpg 3.804e-01 12.1366
## price -8.715e-05 0.1303
##
## Residual Deviance: 3.348e-05
## AIC: 28
# To see which are the most significant coefficients, we use the t-test that is computed next
(t_test_lr_train <- summary(lr_train)$coefficients/summary(lr_train)$standard.errors)## (Intercept) wheel_base length width
## 4.4723754 0.0237417 -0.0044456 0.0509520
## height curb_weight engine_size bore
## -0.0296563 -0.0012971 0.0004049 -4.5568640
## stroke compressio_ratio horsepower peak_rpm
## -3.4445828 -0.2129239 0.0017426 -0.0005897
## highway_mpg price
## 0.0313466 -0.0006688
# Sort the absolute value of the t-test in decreasing order
sort(abs(t_test_lr_train),decreasing=TRUE)## bore (Intercept) stroke compressio_ratio
## 4.5568640 4.4723754 3.4445828 0.2129239
## width highway_mpg height wheel_base
## 0.0509520 0.0313466 0.0296563 0.0237417
## length horsepower curb_weight price
## 0.0044456 0.0017426 0.0012971 0.0006688
## peak_rpm engine_size
## 0.0005897 0.0004049
The most relevant variables seem to be the bore and stroke.
There are a few variables, which are not significant such as the engine_size and the price.
# Classify the responses in the test data set and estimate the test error rate
lr_test <- predict(lr_train,newdata=as.data.frame(X_test))
head(lr_test)## [1] gas gas gas gas gas gas
## Levels: diesel gas
# Number of cars classified in each group
summary(lr_test)## diesel gas
## 5 57
# Table with good and bad classifications
table(Y_test,lr_test)## lr_test
## Y_test diesel gas
## diesel 5 0
## gas 0 57
# Obtain the Test Error Rate (TER)
lr_TER <- mean(Y_test!=lr_test)
lr_TER## [1] 0
# Obtain the probabilities of gas
prob_lr_test <- 1 - predict(lr_train,newdata=X_test,type ="probs")
head(prob_lr_test)## 1 8 14 19 20 27
## 0 0 0 0 0 0
# In green, good classifications, in red, wrong classifications
plot(1:n_test,prob_lr_test,col=colors_errors,pch=20,type="p",
xlab="Test sample",ylab="Probabilities of gas",
main="Probabilities of gas (Logistic Regression)")
abline(h=0.5)Looking at the obtained results with the supervised classification, all the methods classify perfectly the observations of the test sample. Therefore, there would be no difference in using one method or another.
However, these errors are not reliable as the response variable that has been used, fuel_type, makes the data heavily imbalanced. Moreover, the amount of observations of the diesel cars is only 20, so there are not sufficient cars in order to train and test the model correctly. This problem could have been solved with oversampling but in this case, it did not work well, which would be due to the homogeneity of these cars.